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reviewed journal publishing high-impact research which contributes new results and theoretical 
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professionals. It especially provides a platform for high-caliber researchers, practitioners and 
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computer science; IT security, Mobile computing, Cryptography, Software engineering, Wireless 
sensor networks etc. This scholarly resource endeavors to provide international audiences with 
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to IJCSIS and researchers for continued support by citing papers published in IJCSIS. Without 
their sustained and unselfish commitments, IJCSIS would not have achieved its current premier 
status. 

“We support researchers to succeed by providing high visibility & impact value, prestige and 
excellence in research publication.’’ 
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1. Paper 30111508: Student Performance Analysis using Large-Scale Machine Learning (pp. 1-6) 

Abdelmajid Chaffai, Abdeljalil El Abdouli, Honda Anoun, Larbi Hassouni, Khalid Rifi 
RITM LAB ESTC , CED ENSEM, Hassan II University Casablanca, Morocco 

Abstract — The massive data generated by information systems in educational institutes need a special analysis in 
terms of efficient storage management and fast processing analytics, in order to gain deeper insights from the 
available data and extract useful knowledge to support decision making and improve the education service. In this 
work, we propose an intelligent system capable to predict the first year performance in departments chosen by the 
new enrolled student and recommend the suitable department. The system proposed will be helpful for educational 
decision-makers to reduce the failure rate by orienting the new students in department where they can succeed, and 
improve the enrollment management. Our approach is based on applying the machine learning techniques based on 
Map Reduce distributed computing model and the HDFS storage model. Decision tree and Support Vector Machines 
are used to build prediction model and clustering is used to develop a recommender system. To conceive our system, 
we relied on a real case study which is the results of students enrolled in the High School of Technology of 
Casablanca (ESTC), and collected between 2005 and 2014. The dataset is composed of two large data sources: the 
pre-higher education dataset and the result dataset of the first study year in the high school. Through experiments, 
the results generated by our system are very promising and the machine learning infrastructure implemented can be 
used for the future analytics on large variety of data sources. 

Keywords : Machine learning; MapReduce; HDFS; Decision tree; Support Vector Machines; Prediction; 
Clustering ; Recommender system. 



2. Paper 30111513: Leveraging DEMF to Ensure and Represent 5ws&lh in Digital Forensic Domain (pp. 7- 

10 ) 

Jasmin Cosic, ICT Section of Police Administration, Ministry of the Interior of Una- Sana canton Bihac, Bosnia and 
Herzegovina 

Miroslav Baca, Faculty of Organization and Informatics, University of Zagreb, Varazdin, Croatia 

Abstract — In this paper authors will discuss about metadata of digital evidence, digital chain of custody and 
framework for digital evidence management. A concretization of DEMF - Digital Evidence Management 
Framework will be presented. DEMF is a framework that was developed on conceptual - ontology level, and now 
programmed in java. One case study from real criminal case through DEMF will be presented. 

Keywords - Digital evidence management framework, DEMF, Inter er ability, Digital evidence, Chain of custody 



3. Paper 30111525: GCTrie for Efficient Querying in Cheminformatics (pp. 11-16) 

Yu Wai Hlaing, University of Computer Studies, Yangon, Myanmar 

Kyaw May Oo, Faculty of Computing, University of Information Technology, Yangon, Myanmar 

Abstract — The field of graph indexing and query processing has received a lot of attention due to the constantly 
increasing usage of graph data structures for representing data in different application domains. To support efficient 
querying technique is a key issue in all graph based application. In this paper, we propose an index trie structure 
(GCTrie) that is constructed with our proposed graph representative structure called graph code. Our proposed 
GCtrie can support all types of graph query. In this paper, we focus on index construction, subgraph query and 



supergraph query processing. The experimental results and comparisons offer a positive response to the proposed 
approach. 

Keywords - Graph indexing and querying; Graph representative structure; Index; Subgraph query; Supergraph 
query 



4. Paper 30111532: An Efficient Migration Scheduling Algorithm Based on Energy Consumption and 
Deadline in Cloud Computing (pp. 17-23) 

Sogol Davati, Department of Computer Engineering, South Tehran Branch, Islamic Azad University, Tehran, Iran 
Ahmad Khademzadeh, Education and National International Scientific Cooperation Department, Iran 
Telecommunication Research Center (ITRC), Tehran, Iran 

Abstract — Cloud is one of the most important shareable environments which several customers are connected to it 
in order to access services and products. In this environment all the resources are available in an integrated 
environment and several users can simultaneously send their requests. In this condition some methods are required 
to schedule in an efficient way. Important condition is when a cloud server is overloaded or request of user has 
deadline constraint. In these situations to provide efficient service, migration occurred. Migration of virtual 
machines can be used as a tool for achieving balance. In other words when load of virtual machines in a physical 
machine increased, some of the virtual machines migrate to another physical machine so that virtual machines do 
not face any shortage of capacity. Virtual machine migration is used for many reasons like reducing power 
consumption, load balancing in virtual machines, online maintenance and capability of responding to requests with 
deadline constraint. In this thesis, we present an efficient algorithm for scheduling live migration of virtual machines 
which have applications with deadline constraint. This scheduling uses thresholds for migrating virtual machines in 
order to guarantee service level agreement (SLA). Simulations show that the proposed method acts better in number 
of received requests with success and level of energy consumption. 

Keywords- Cloud computing; Virtual machine migration; Threshold; Cloudsim; SLA; Scheduling 



5. Paper 30111533: EEG Discrimination of Rats under Different Architectural Environments using ANNs 
(pp. 24-31) 

Mohamed.I.El-Gohary, Department of Physics, Al-Azhar University, Cairo, Egypt 
Tamer.A.A.Al-Zohairy, Department of Computer science, Al-Azhar University, Cairo, Egypt 
Amir. M.M. Eissa, Department of Physics , Al-Azhar University, Cairo, Egypt 
Sally .M. Eldeghaidy, Department of Physics ,Suez canal University, Cairo, Egypt 
Hussein.M.H.Abd El-Hafez, Department of Physics , Al-Azhar University, Cairo, Egypt 

Abstract — The present work introduces a new method for discriminating electroencephalogram (EEG) power 
spectra of rat's brain housed in different architectural shapes. The ability of neural networks for discrimination is 
used to describe the effect of different environments on brain activity. In this research the rats were divided into four 
groups according to the type of environmental shapes as: control (normal cage), pyramidal, inverted pyramidal and 
circular. The brain activities (EEG) were recorded from rats of each group. Fast Fourier Transform (FFT) analysis of 
EEG signals was carried out to obtain power spectra. Two different neural networks are used as classifiers for power 
spectra of the different 4 groups: multi-layer perceptron (MLP) with backpropagation and radial basis function RBF 
networks with unsupervised K means clustering algorithm. Experimental studies have shown that the proposed 
algorithms give good results when applied and tested on the four groups. The multilayer with backpropagation and 
radial basis function networks achieved a performance rate reaching 94.4 % and 96.67% respectively. 

Keywords: EEG, Architectural shape, Artificial neural networks, Power spectrum, Backpropagation Algorithm, 
Radial basis function network. 



6. Paper 30111534: Forward Error Correction for Storage Media: An Overview (pp. 32-40) 



Kilavo Hassan, Nelson Mandela African Institution of Science and Technology, School of Computational and 
Communication Science and Engineering, P.O. Box 447 Arusha, Tanzania 

Kisangiri Michael, Nelson Mandela African Institution of Science and Technology, School of Computational and 
communication Science and Engineering, P.O. Box 447 Arusha, Tanzania 

Salehe I. Mrutu, The University of Dodoma, College of Informatics and Virtual Education, P.O. Box 490 Dodoma 

Abstract — As the adoption of Information and Communication Technology (ICT) tools in production and service 
rendering sectors increases, the demand for digital data storage with large storage capacity also increases. Higher 
storage media systems reliability and fault tolerance are among the key factors that the existing systems sometimes 
fail to meet and therefore, resulting into data loss. Forward error correction is one of the techniques applied to 
reduce the impact of data loss problem in digital data storage. This paper presents a survey conducted in different 
digital data storage companies in Dar es Salam, Tanzania. Data were collected and analyzed using Statistical 
Package for Social Sciences (SPSS). Secondary data were captured from user and manufacturer technical reports. It 
was revealed that data loss is still a predominant challenge in the digital data storage industry. Therefore, the study 
proposes the new storage media FEC model using locked convolutional encoder with the enhanced NTC Viterbi 
decoder. 

Index Terms — Storage Media, FEC, NTC, Viterbi, RS 



7. Paper 30111535: Certificate Based Hybrid Authentication for Bring Your Own Device (BYOD) in Wi-Fi 
enabled Environment (pp. 41-47) 

Upasana Raj, Information Security and Cyber Forensics, Department of Information Technology, SRM University, 
Chennai, India 

Monica Catherine S, Information Security and Cyber Forensics, Department of Information Technology, SRM 
University, Chennai, India 

Abstract — Approval of the strategy, ‘Consumerization of IT’ by the organizations, does not only save money and 
increase business agility, but also improves employee productivity and satisfaction, lowers IT procurement, support 
costs and improves collaboration. Organizations have started to develop “Bring Your Own Device” (BYOD) 
policies to allow their employees to use their owned devices in the workplace. It’s a hard trend that will not only 
continue but will accelerate in the coming years. In this paper we focus on the potential attacks that can strike on 
BYOD when authenticating to a Wi-Fi network. It also enumerates the authentication protocols and methods. A 
proposal for stringent, indigenous hybrid authentication for a device that embraces BYOD strategy in a Wi-Fi 
enabled environment is proposed to the end of the paper. 

Keywords — Bring Your Own Device (BYOD), Wi-Fi, Authentication, Authorization, Certificate. 



8. Paper 30111536: Impact of Spectrum Sensing and Primary User Activity on MAC Delay in Cognitive 
Radio Networks (pp. 48-52) 

Elham Shahab, Department of Information Technology Engineering, Nooretouba Virtual University of Tehran, 
Tehran, Iran 

Abstract — Cognitive radio (CR) technology is used in wireless networks with the aim of handling the wireless 
spectrum scarcity. On the other hand, using CR in wireless networks has some new challenges and performance 
overheads that should be resolved. It is crucial to study the effects of the unique characteristics of cognitive radio 
over the various protocols. In this paper, a simulation-based performance evaluation is presented in the term of 
MAC delay in a CR user. The effects of spectrum sensing parameters (sensing duration and frequency), the primary 
users’ activity and the number of spectrum channels are investigated on the delay of MAC layer. The growth and 
decay rates of MAC delay are studied more in detail through the various NS2-based simulations. The results give 
some fruitful insights for formulating the delay of MAC based on the CR unique parameters. 



Keywords -Cognitive radio; MAC; Delay; Primary user; Spectrum sensing 



9. Paper 30111542: A Survey of Receiver Designs for Cooperative Diversity In The Presence of Frequency 
Offset (pp. 53-58) 

Sylvia Ong Ai Ling, Hushairi Zen, Al-Khalid b Hj Othman, Mahmood Adnan 

Faculty of Engineering, University Malaysia Sarawak, Kota Samarahan, Kuching, Sarawak, Malaysia 

Abstract — Cooperative diversity is becoming a potential solution for future wireless communication networks due 
to its capability to form virtual antenna arrays for each node (i.e. user). In cooperative networks, the nodes are able 
to relay the information between the source and the desired destination. However, the performance of the networks 
(for instance - mobile networks, ad-hoc networks and vehicular networks) is generally affected by the mobility of 
the nodes. As the nodes’ mobility rapidly increases, the networks are subjected to frequency offset and unknown 
channel properties of the communication links which degrades the system’s performance. In a practical scenario, it 
is a challenging task and impractical for the relay and destination to estimate the frequency offset and channel 
coefficient especially in time varying environment. In this manuscript, a comprehensive survey of existing literature 
for receiver designs based on Double Differential (DD) transmission and Multiple Symbol Detection (MSD) 
approach is presented to eliminate the complex channel and frequency offset estimation. 

Index Terms — Cooperative Diversity, Double Differential, Frequency Offset, Multiple Symbol Differential 
Detection. 



10. Paper 30111544: Measuring and Predicting the Impacts of Business Rules Using Fuzzy Logic (pp. 59-64) 

Mohamed Osman Hegazi 

Department of Computer Science, College of Computer Engineering and Science, Prince Sattam University, Saudi 
Arabia 

Department of Computer Science, College of Computer Science and IT , Alzaiem Alazhari University Sudan 

Abstract - The relations between the constituent elements of the predominant activities may cannot be measured 
using the accurate measurement, the reason may be on the conflict or the effect of such rules with other factor, 
accordingly uncertainty measurement is needed. This paper presents a measurement and predictive fuzzy model that 
can be fit to work on such rules. The proposed model measures and predicting the impacts of such rules using fuzzy 
logic, it designed based on the concept of fuzzy logic, fuzzy set and the production rule. The model transforms if- 
then business rule to weighted fuzzy production rule, and then used this production rule for predicting and 
measuring the impact of the business rule. The model is tested using real data and provide considerable results. 

Keyword: Fuzzy logic, Fuzzy set, Production rule, Business rule 



11. Paper 30111549: Blind User Visualization and Interaction in Touch Screen: A Designer Perspective (pp. 
65-72) 

Abdelrahman H. Hussein, College of Computer and Engineering Science, University of Hail, Hail, Saudi Arabia 

Abstract — In this paper, we describe how blind students views external system using an image map as a case 
study. We proposed two interaction techniques which allow blind students to discover different parts of the system 
by interacting with a touch screen interface. An evaluation of our techniques reveals that 1) building an internal 
visualization, interaction technique and metadata of the external structure plays a vital role 2) blind students prefer 
the system to be designed based upon their behavioural model to easily access and build the visualization on their 
own and 3) to be an exact replica of visualization, the metadata of the internal visualization is to be provided either 
through audio cue or domain expert (educator). Participants who used touch screen are novice users, but they have 
enough experience on desktop computers using screen readers. The implications of this study to answer the research 
questions are discussed. 



Keywords- Blind; Visualization; Touch Screen; Accessibility; Usability; Image Map. 



12. Paper 30111554: An Implementation of Android-based Steganography Application (pp. 73-76) 

Sultan Zavrak, Dept, of Computer Engineering, Duzce University, Duzce, Turkey 
Seyhmus Yilmaz, Dept, of Computer Engineering, Duzce University, Duzce, Turkey 
Huseyin Bodur, Dept, of Computer Engineering, Duzce University, Duzce, Turkey 

Abstract — The rapid development of smart phone technology has led to cheapening of the phone equipped with 
many advanced features such as sensors. One of the most widely used sensor in the phone is its camera. Although 
the photographs captured by camera can be shared via many ways, one of the most commonly used sharing methods 
is Multimedia Message Service (MMS) which allows transmission of files such as photographs, audio and video. A 
major disadvantage of MMS is that it does not provide sufficient safety mechanism and because of this, the data of 
the people who wants to hide confidential information from state-controlled systems can be easily monitored. In this 
study, unlike cryptography-based, a steganography-based mobile application that can embed the confidential 
information into an image, send it to receiver, and extract the confidential information from the image in the receiver 
side is developed. Besides, the performance data such as the embedding and extraction time of confidential 
information and experimental results of application are given. 

Keywords — Multimedia Message Service, MMS, Steganography, Smart Phone. 



13. Paper 30111558: Survey of the Adaptive QoS-aware Discovery Approaches for SO A (pp. 77-81) 

Monika Sikri, Cisco Systems India Pvt. Ltd. SEZ, Embassy Tech Village, Panathur Devarabeesanahalli, Bangalore 
East Taluk Bangalore India 

Abstract — Service Oriented Architecture is very commonly used as an architectural paradigm to model distributed 
integration needs. It provide the means of achieving organizational agility by building applications, which adapt to 
the dynamic needs of the business. Agility to Adapt is one of the key drivers for its growth. Its widespread adoption 
has led to the proliferation of multiple services offering similar functionality but different Quality of Service on the 
enterprise network. In real-time enterprise environment services are added and removed from the network on the fly. 
The service discovery approach does not only need to consider the QoS of other similar services but also the impact 
of dynamic and unpredictable system behavior on QoS. In view of this, there is a need for adaptive discovery 
approach that can accommodate these run-time changes and keep the system functioning despite the 
unpredictability. As part of this work we have reviewed existing works in Adaptive QoS -aware discovery for SO A 
systems to understand the gaps and need for future research. 



14. Paper 30111563: M-Learning for Blind Students Using Touch Screen Mobile Apps Case Study - Special 
Education in Hail (pp. 82-88) 

Abdelrahman H. Hussein (1), Majed M. AlHaisoni (I), Ashraf A. Bany Mohammed (2), Mohammed Fakrudeen (3) 

(1) College of Computer Sciences and Engineering, University of Hail, Hail, Kingdom of Saudi Arabia 

(2) School of Business, The University of Jordan, Amman, Jordan 

(3) Dept, of Computing and Technology, Anglia Ruskin University, Chelmsford, United Kingdom 

Abstract - The relative newness of the touch-screen (TS) based device creates a phenomenon unique and unstudied 
in the academic environment with regard to blind students dependent on Braille. This qualitative research study 
explores how the use of a multi-modal touch-screen based device affects the academic environment for totally blind 
students using YouTube videos. The pilot program (android app) included a prototype for the English course offered 
to fifth grade level pupils attending primary school in Hail, KSA. Data collected from students through a survey and 
focus group interviews and from the faculty through individual interviews was coded and organized according to the 
research questions. Findings analysis was organized by way of the study’s conceptual framework: (a) substitution of 



Braille course materials with YouTube video lessons (b) accessibility and usability of the developed prototype. 
Findings concluded that the majority of students in this study perceived YouTube course materials on an touch- 
screen based device (using android app) to be as good as, or better, than Braille course materials, the multi-modal 
functionality of the touch-screen based device augmented personal study and classroom learning, and the personal 
use positively contributed to academic use of the device. 

Keywords- Accessibility ; Usability; Touch screen; M-learning; YouTube videos; and Blind students 



15. Paper 30111506: A Fitness-Gossip Routing Protocol for Saving Energy in Wireless Sensor Networks (pp. 
89-95) 

Kareem Radi Hassan, Department Of Computer Science, University of Basrah, Basrah, Iraq 

Abstract - Gossiping is traditional routing scheme commonly used in WSNs to transport data in a multi hop manner 
because of limited radio range and energy constraints. Gossiping protocol is simple and do not involve additional 
devices, but at the same time it faces several deficiencies when it used in WSNs. This paper describes an efficient 
technique to transmit the message to the sink node. The main idea behind the new protocol which is called (Fitness- 
Gossiping Protocol FGP) which is a modification of Gossiping protocol by using a fitness function to select the 
optimum next node, when the optimum next node is selected the data is transmitted to the next node. We discussed 
how the new approach saved the energy of the network and to achieve maximize the network lifetime in comparison 
with its counterparts. In the same time, the Fitness-Gossiping protocol provides the balanced energy between nodes. 

Index Terms — Gossiping, Fitness- Gossip, Network lifetime, routing, Wireless sensor networks (WSNs). 



16. Paper 30111528: Computing the Dynamic Reconfiguration Time in Component Based Real-Time 
Software (pp. 96-104) 

Asghar Farhadi, Computer depart, Islamic azad university Arak branch, Arak, Iran 
Mehran sharafi, Computer depart, Islamic azad university Najafabad Branch, Najafabad ,Iran 

Abstract - New microcontrollers with enhanced application capabilities for more complex scenarios are developed. 
However, the settings are volatile and ever-changing environment requires permanent systems that compatible with 
the new conditions. Dynamic re-configuration has a powerful mechanism for the implementation of the adaptation 
strategy. Real-time control system is one of the challenges for the implementation of dynamic re-configuration 
software. In previous works the Adapt.NET is adapted in framework of the implementation of component-based 
applications. A new web-based test complete reconfiguration is proposed here with a limited time. The application 
dynamically re-configures the client-side component compatibility in case of failure of the component parts. In this 
article the timing behavior of the implemented dynamic reconfiguration algorithm is analyzed. The manner of the 
hybrid component-based applications adaptation during environmental condition changes are described as well. In 
this article ,in order to predict the implementation time the behavior of dynamic reconfiguration algorithm and the 
manner of real-time planning that can be adapted to environmental changes are assessed, as well as the correlation of 
reconfiguration during the deadline period. 

Key words: Dynamic reconfiguration, Blackout, Reconfiguration time, Adaptation, state 



17. Paper 30111556: Intrusion Detection Systems: A Novel Multi-Agent Specification Method (pp. 105-110) 

Eddahmani Said & Rahal ROMADI, University Mohammed V Rabat, Faculte des Sciences de Rabat, Morocco 
Bouchaib BOUNABAT, University Mohammed V Rabat, ENSIAS, BP 713, Agdal Rabat, Morocco 

Abstract - Intrusion detection systems play a major role in information systems security. However, the complexity of 
a network flow can cause the outbreak of numerous false alarms, commonly called false positives or cannot detect 
malicious traffic, i.e. false negatives. Our objective in the present paper is to put forward a novel method that 



combines several types of monitoring, intrusion and malware detection tools whose main role is to reduce the rate of 
false positives/negatives. 

Keywords: malware detection system , Reactive Agent, specification, Formal methods. 



18. Paper 30111517: A Brief Overview of Cooperative MAC Protocols in Wireless Networks (pp. 111-116) 

Ferzia Firdousi, Department of Electrical Engineering, National University of Sciences and Technology, 
Rawalpindi, Pakistan 

Abstract - Cooperative diversity is a fairly recent technique. It allows the benefits of space diversity to be 
incorporated into the wireless communications network. Whereas it has found widespread use in network, MAC and 
physical layers, this paper highlights its implementation in the MAC layer as a researchable area. A wide number of 
techniques have been deployed in the MAC layer with a scope for good cooperation for the modern wireless 
networks. While all of them have been found to be useful, a few of them have disadvantages as well. In this paper, 
we present a brief overview of the MAC protocols deployed in the cooperative diversity of wireless communication. 
This survey is compact enough to understand properly but detailed enough to provoke research ideas. 

Keywords - Cooperative Diversity, Mac Protocol, Energy Consumption 



19. Paper 30111555: Architectural Framework for Inter-Agency Intelligence Information Sharing Knowledge 
Based System in Botswana (pp. 117-131) 

Ezekiel U Okike, Computer Science Department, University of Botswana, Gaborone, 00000, Botswana 
T. Leburu-Dingalo, Computer Science Department, University of Botswana, Gaborone, 00000, Botswana 

Abstract - The 21st century is witnessing increasing wave of terrorist and crime activities. This has made the need 
for intelligence-led policing the focus of most national governments including Botswana. The main objective of this 
paper is to propose an architectural model for intelligence-Led policing in Botswana which will provide support for 
all intelligent agencies in the country namely the Botswana Police Service (BPS), Directorate of Intelligence and 
Security (DISS), Directorate of Corruption and Economic crime (DCEC), and Criminal Investigation Department 
(CID). The model provides for inter agency information sharing using appropriate modern technologies in other to 
enable the agencies access to useful crime data which will enhance their ability to make informed decisions and take 
appropriate actions against cyber and economic crimes. 

Keywords: Cyber and economic crime, Security, intelligence policing, architectural framework, Knowledge base, 
Information sharing, system model. 



20. Paper 30111518: A Comprehensive Review of Different types of Cooperative MAC Layer Protocols (pp. 
132-141) 

Ferzia Firdousi, Department of Electrical Engineering, National University of Sciences and Technology, Pakistan 

Abstract - For physical layer, cooperative systems have been devised which aim at increasing diversity gain 
achieved by spatial diversity. This advantage can be mapped onto the MAC layer to achieve throughput increment, 
faster transmission rate, reduction in power and energy consumption, and a large network coverage area. However in 
the race to achieve a good MAC layer protocol, many new problems are created, including a redundant use of relay 
nodes, increase in energy wastage and increase in delay etc. In order to understand the true capabilities of 
cooperative communication at the MAC layer, many protocols need to be studied and their drawbacks identified, so 
that the upcoming research can be improved. This paper clinches research regarding these different types of 
cooperative MAC protocols by summarizing and analyses them. The analytical result, discourses issues like which 
relay node is the optimal one, how can a system be made energy efficient and which methodology to be followed for 



forwarding a data packet in case of transmission failure, and explains them in detail to allow room for future, new 
and improved research. 

Keywords: Cooperative, MAC, Relay selection, Diversity gain, Energy efficiency 
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Product and Process Quality criteria: A Case Study in Software Engineering Practice in Botswana (pp. 142- 
149) 

Ezekiel U Okike & Motsomi Rapoo, Computer Science Department, University of Botswana, Gaborone, 00000, 
Botswana 

Abstract - The need to ensure quality in software engineering practice necessitated the introduction of software 
measurement and other quality standards introduced by the software engineering institute. The appropriate 
introduction of quality standards however does not assume that practitioners easily adapt to its utilization without 
assistance by experienced professional. This study aims to investigate software engineering practice in Botswana in 
order to assist software companies understand and use software engineering quality measures and standards. 14 
software companies were identified out of which 7 indicated interest in participating in this study. The result 
indicates that most of the participating companies are yet to satisfy the Capability Maturity Model Integration 
(CMMI) 18 key performance areas at the 5 levels. Of the 5 companies which indicated that they are using 
CMM/CMMI standard, only 1 company satisfies 100% CMMI requirements. The study reveals the need to launch a 
programme to bring CMMI into Botswana and to train software companies on the use of appropriate software metric 
tools in order to ensure software quality in Botswana. 

Keywords: Software engineering, software measurement, software quality, cohesion, capability maturity model 
integration 



22. Paper 30111553: Trust: Models and Architecture in Cloud Computing (pp. 150-153) 

Usvir Kaur, I. K. Gujral Punjab Technical University, Jalandhar 

Dheerendra Singh, Dept, of Computer Science & Engineering, Shaheed Udham Singh College of Engineering and 
Technology, Tangori, Mohali, Punjab, India 

Abstract - In today era, Cloud Computing has become an emerging technology. Various service providers are 
available in market with different configurations. It becomes very difficult for users to choose between various cloud 
service providers. Trust plays a vital role in field of cloud computing services such that it enables users to choose 
particular cloud service provider from available service on basis of how this technology behaves in past. This paper 
discusses various parameters and trust model framework. Different trust models for single web service and various 
parameters used by them for calculating trust are reviewed. 

Keywords: cloud computing, trust model architecture, security. 



23. Paper 30111557: Towards a mobile payment market: A Comparative Analysis of Host Card Emulation 
and Secure Element (pp. 156-164) 
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Pierre E. Abi-Char, Department of Computer Science, American University of the Middle East, Kuwait 
Gheorghita Ghinea, Department of Computer Science, Brunei University, Kingston Lane, Uxbridge, Middlesex, 
UB8 3PH, United Kingdom 

Abstract — The considerable existing potential for mobile payments adoption shows that businesses are interested to 
increase the number of electronic transactions while consumers are attracted to convenient ways for fast and 
accessible banking. Nevertheless, the belief that the value of the Near Field Communication technology has not yet 



been fully recognized - particularly in the consumer marketplace - persists. However, the introduction of Android 
4.4 operating system namely ‘KitKat’ has pushed the Near Field Communication (NFC) market towards Android 
devices with the recently proposed Host Card Emulation (HCE) technology. Moreover, there are various debates 
about the ways in which mobile payment processes should be managed. Currently, the most recognized and 
accepted methods for managing the mobile payment processes are the traditional Secure Element (SE) approach and 
the Host Card Emulation which has lately become a crucial topic for key industry players. This paper describes the 
aspects of moving forward with mobile wallets. More specifically, a broad discussion is developed to consider the 
pros and cons of both approaches. Correspondingly, a detailed analysis is carried out centred on the security and 
adoption issues that these approaches may raise. 

Keywords - Near Field Communication; Secure Element; Host Card Emulation; Mobile transaction. 



24. Paper 30111564: Performance Analysis of Sybil Decline: Attack Detection and Removal Mechanism in 
Social Network (pp. 165-171) 

Deepti S. Sharma, Department of CSE, LKCT, Indore, M.P, India 
Dr. Sanjay Thakur, Department of CSE, LKCT, Indore, M.P, India 

Abstract - Peer to peer system involves communication between the two directly connected hosts. As it supports the 
open communication medium it suffers from various security threats from remote computing elements. To avoid 
these threats various systems and integrated solutions are proposed over the last few years. One of such well known 
problem is Sybil attacks in which user performs unauthorized or malicious activities by creating the fake identities 
over the webs. One way to keep such threat away from the system is to design a centralized trusted authority system. 
Among its protection approaches Sybil Defender, Sybil Limit, Sybil Guard and Sybil Shield are some well known 
tools. After analyzing these tools and their respective mechanism we have found that they somewhere lack the 
associated trust computation issues. For improving such issues we have suggested a novel approach named as Sybil 
Decline. This paper gives a performance evaluation of suggested approach on some performance monitoring 
parameters. Extensive analysis and number of experiments are performed here to prove the results authenticity and 
effectiveness of the suggested approach. 

Keywords - Sybil, Security, Peer to peer system, trusted authority 



25. Paper 30111561: Cost and Performance Based Comparative Study of Top Cloud Service Providers (pp. 
172-177) 
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Abstract — Recent boom in the cloud computing industry has caused a shift in the information technology industry 
and has affected the way how information and data is stored and shared among the enterprise. The advent of social 
applications also demands the availability of resources that can be shared among the others. Cloud based 
architecture has made it possible for enterprises to utilize the computation power that was not available in the past. 
This paper takes a look and compares the top available service providers on the basis of the cost for each computing 
model as well takes a look at the performance by measuring the response time. It is observed that at all these service 
providers and elaborated the comparison of all based on their available architectures. 

Keywords — Comparison of Cloud Operators, Cloud Computing, Azure, RackSpace, Amazon Web Services 
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Abstract — The massive data generated by information systems 
in educational institutes need a special analysis in terms of 
efficient storage management and fast processing analytics, in 
order to gain deeper insights from the available data and extract 
useful knowledge to support decision making and improve the 
education service. In this work, we propose an intelligent system 
capable to predict the first year performance in departments 
chosen by the new enrolled student and recommend the suitable 
department. The system proposed will be helpful for educational 
decision-makers to reduce the failure rate by orienting the new 
students in department where they can succeed, and improve the 
enrollment management. Our approach is based on applying the 
machine learning techniques based on Map Reduce distributed 
computing model and the HDFS storage model. Decision tree and 
Support Vector Machines are used to build prediction model and 
clustering is used to develop a recommender system. To conceive 
our system, we relied on a real case study which is the results of 
students enrolled in the High School of Technology of Casablanca 
(ESTC), and collected between 2005 and 2014. The dataset is 
composed of two large data sources: the pre-higher education 
dataset and the result dataset of the first study year in the high 
school. Through experiments, the results generated by our system 
are very promising and the machine learning infrastructure 
implemented can be used for the future analytics on large variety 
of data sources. 

Keywords: machine learning; MapReduce; HDFS; Decision 
tree; Support Vector Machines; prediction; Clustering; 
recommender system 

I. Introduction 

Educational Data Mining (EDM) is defined on the site 
community www.educationaldatamining.or as: "Educational 
Data Mining is an emerging discipline, concerned with 
developing methods for exploring the unique types of data that 
come from educational settings, and using those methods to 
better understand students, and the settings which they learn 
in". Several methods and approaches are used in Educational 
data mining research including data mining, machine learning, 
and exploratory data analysis in order to understand data and 
model the hidden relationships among data sets. These methods 
include classification, clustering, and association rules. 
Nowadays, computer systems in educational institutes are 
producing gradually year by year a huge quantity of data. There 
is a great interest to use these massive data to analyze, 



understand and extract or bring out some solutions to problems 
that affect educational system such the failure in first year, and 
help the right decision making. 

Managing and analyzing huge volume of data in the machine 
memory is not an option, it needs a special infrastructure. We 
implement a hybrid distributed solution for both storage and 
analysis on cluster, this solution will ensure a scalable storage 
of educational data and fast processing analytics. 

In this work, we use many techniques of machine learning and 
exploratory data analysis to produce an interactive dashboard 
dedicated for decision-makers in the aim to summarize the 
main characteristics of student data with visual methods, 
predict the performance of the new enrolled students in the 
first year and recommend the department where they can have 
good results. 

II. Machine Learning 

Machine Learning is a field of study and development 
methods capable to learn from data and transform the learning 
into action to accomplish a defined task. A formal definition 
cited by Tom Mitchell [1] is "A computer program is said to 
learn from experience E with respect to some class of tasks T 
and performance measure P, if its performance at tasks in T, as 
measured by P, improves with experience E". 

Machine learning is divided in three categories depending 
on type of data available and the desired task to accomplish. 

a. Supervised learning: input data in a learning algorithm is a 
set of examples or observations whose response or target is 
known. The principle of algorithm is to model and 
optimize the relationship between target attribute and other 
attributes from the examples of the set of training data, and 
the model is applied on new unlabeled data. Supervised 
machine learning is widely used in building the predictive 
models. 

b. Unsupervised learning: the target attribute is not specified 
in input data to give to the learning algorithm. 
Unsupervised learning is about analyzing the unlabeled 
data and finding the data structure. It is widely used in 
building the descriptive models such clustering, 
association rules, recommender system, and pattern 
discovery. 
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C. Reinforcement learning: a machine accomplishes a task 
and interacts with dynamic environment by producing 
actions without a human intervention. Reinforcement 
learning is used in many applications such autonomous 
car. 

A. Classification with Decision tree 

Decision tree learner use a method known as "divide and 
conquer" to build a model from a set of attributes in tree 
structure based on logical decisions. The built tree is composed 
of one root, branches, interior nodes and leaves nodes. Each 
interior node represents test made on one value of the input 
attributes, each branch is the result of decision’s choice, and 
each leaf is a terminal node that represent target value or class 
label in classification. To build a decision tree, an appropriate 
test is applied to divide the samples into homogeneous subsets 
of similar classes. The perfect test would divide the data into 
segments of single class, they are considered pure. 

There are many implementations of decision tree 
algorithms, the most recent and known is C5.0 [2] algorithm, 
this algorithm was developed by J.Ross Quinlan. C5.0 is a 
successor and improvement of his prior algorithm C4.5 [3]. 

B. Classification with Support Vector Machines 

Based on the work of Corinna Cortes and Vladimir Vapnik 
in 1995 [4] Support Vector Machine (SVM) model is a 
representation of the examples as points in space. It uses 
boundaries called hyperplanes to separate a set of instances into 
homogeneous groups in d-dimensional space, the hyperplane is 
a line in 2D space but in multi-dimensional space it is a flat 
surface. SVM learner attempts to create the greatest separation 
to the nearest training-data points from each class by defining 
the Maximum Margin Hyperplane (MMH). The points of any 
class that are the closest to the MMH are called the support 
vectors. In real-world application, the classes are not linearly 
separable. SVM uses kernel functions and transform the data 
into a higher dimensional feature space, using this process a 
nonlinear relationship can be linear. 

C. Clustering 

Clustering is an unsupervised machine learning technique 
that divides the data in natural groups known as clusters of 
similar elements; it provides additional and new information 
about data. 

The most often used algorithm is k-means whose principle is: 

a) Partition data in k groups into a predetermined number 
of clusters. 

b) Choose random k points as cluster centers. 

c) Assign examples to their nearest cluster center 
according to the Euclidean distance function. 

d) Update the centroids for the cluster by calculating the 
mean value of the points assigned to the cluster. 

e) Repeat phases 2, 3 and 4 until the same points are 
assigned to each cluster. 



III. Related Work 

Brijesh Kumar Baradwaj and Saurabh Pal [5] used 
classification with Decision tree based on ID3 by selecting a 
sample of 50 Master of computer applications students with 7 
predictors, to evaluate and predict the student’s performance at 
the end of the semester; the knowledge is extracted from 
classification in the form of IF-THEN rules. 

Zlatko J. Kovacic [6] conducted a study by selecting an 
enrollment dataset of 450 students to analyze the influence of 
socio-demographic variables (age, gender, ethnicity, education, 
work status, and disability) and study environment (course 
programme and course block) in predicting the student’s 
success, using four different methods of classification with four 
different classification trees (CHAID, exhaustive CHAID, 
QUEST and CART) and logistic regression. Data analytics 
were performed with SPSS 17 and Statistica 8. 

M.Sindhuja et al., [7] conducted a study by selecting a 
dataset collected from a tutoring system to perform analytics 
based on related attributes such as behavior, attitude and 
relationship. Prediction of good behavior, average attitude and 
good relationship with faculty members and tutors was fitted 
with hierarchical clustering analysis using DBScan clustering, 
data analytics were performed with WEKA tool. 

Krina Parmar et al., [8] proposed a framework to predict 
student performance based on attributes like past results, mid 
semester results, academic punctuality and performance in 
various tests. Datasets are distributed over servers, metadata 
are managed by a middleware that allows applying the data 
mining algorithms on the datasets like classification, clustering, 
association rule. The final model is a merging of results of each 
local model performed where the data are located. 

IV. Tools 
1 ) Hadoop Framework 

The Apache Hadoop [9] is an open-source software 
framework written in Java. It is a set of algorithms designed to 
make the storage and the processing distributed across clusters 
of commodity computers, using a specific model of computing 
called MapReduce, and a specific model for storage called 
HDFS. Hadoop is designed with a very high degree of fault 
tolerance, and resolve problems of dead nodes with 
replication, which is one of the most powerful characteristic in 
this framework comparing with other existing distributed 
systems. 

1.1) HDFS: The Hadoop Distributed File System inspired 
from Google file system GFS is a scalable distributed file 
system for large distributed data-intensive applications [10]. It 
provides scalable and reliable data storage. HDFS is 
master/slave architecture, composed of two main components: 
the NameNode is the master and the DataNode is the slave. 
The NameNode contains the metadata of all files system 
including permissions, locations of data, and information of 
replicated blocks and manage the access to file by the client. It 
divides the input data into blocks respecting the block size in 
the configuration files (from 64Mo to 128Mo) and store data 
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blocks into DataNodes in an intelligent manner, in order to 
optimize the network bandwidth. The NameNode orchestrates 
the storage operation occurred in data nodes while the 
DataNode stores data and serves it to client requests. 

1.2) MapReduce is a programming model and an associated 
implementation for processing and generating large data sets 
[11]. It's also master/slave architecture, it allows to write 
applications in a parallel manner over a large data sets across 
the cluster, the design of a distributed program is divided in 
two main phases Map and Reduce. Management of the 
distributed process is ensured by the Job Tracker daemon in 
the master node, and the execution of tasks map and reduce is 
ensured by the Task Trackers daemons in the slaves nodes. 

2 ) Hadoop streaming 
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which are then submitted to the Hadoop cluster for execution, 
and returns the results to the user. Hive uses metastore service 
to store metadata for tables in relational database. By default 
the metastore uses derby which is embedded Java relational 
database, it can't be used in multiuser environment, we use 
MySQL database as a metastore. Hive provides a service that 
allows any client to submit requests and retrieve results, this 
service is ensured by the hiveserver2 [17] based on Thrift RPC 
[18]. To connect R environment with Hive warehouse, we use 
the Rhive [19] library which allows to access, manage and 
process over Hive data warehouse in R. In this work, we use 
Hive to build tables from dataset stored in HDFS, and 
integrate visualization tools to perform the exploratory data 
analysis. 



The Java MapReduce API is the standard option which 
allows writing MapReduce programs in java. To write 
MapReduce jobs in other programing or scripting languages, 
we use an utility called Hadoop Streaming [12]. It is a generic 
API that provides options which allows writing and running 
executable scripts using standard input and output over the 
Hadoop cluster. Hadoop streaming supports C++, PHP, Perl, 
Shell, Ruby, and R. 

3) R with hadoop 

Analyzing large dataset in a single machine with R [13] 
algorithms is constrained by the limitations of memory size. 
Combining R power in analytic with the power of Hadoop 
cluster is an efficient solution to extend R processing 
capabilities on massive data. 

In this work, we use two packages rmr2 [14] and rhdfs [15] 
developed by Revolution Analytics to connect R environment 
with the Hadoop cluster. Rmr2 package allows writing 
MapReduce programs in R, and depends on the Hadoop 
streaming, rhdfs package provides HDFS file management in 
R. The mapreduce() is the main function in rmr2 package, its 
syntax with arguments is: 

mapreduce(input, output, input.format, output. format, map, 
reduce). 

The input and output arguments are the paths of input and 
output HDFS data. The input.format and output. format 
arguments are in string format with additional arguments to 
specify the format of the files. 
The map and reduce arguments are the two steps of 
MapReduce algorithms implemented in R language. 



4) Hive 

Hive is data warehouse software developed initially by 
Facebook [16], it allows users query and manage large 
datasets stored in Hadoop distributed cluster using a language 
called Hive Query Language (HQL) very similar to standard 
ANSI SQL. Hive provides an abstraction on top of Hadoop, it 
translates the ad-hoc queries in one or more MapReduce jobs, 



V. Methodology of Experiment 
1 ) Environment experiment 

We deployed a Hadoop cluster on 11 machines with the 
following configuration: Intel(R) Core(TM) i5-3470 CPU 
3.20GHz (4CPUs), 8GB Memory, 300GB hard disk. One 
machine is designated as master (Namenode and JobTracker), 
the others are workers (DataNodes and TaskTrackers). R 
environment, with rmr2 and rhdfs packages are installed on all 
nodes. Hive and Rhive package are deployed on master . 

The Platform architecture is shown in Figure 1 : 







R environment 







Figure 1. Distributed analytics Platform architecture for Machine learning 
using R environment 



2 ) Problem definition and goals 

The high school of technology of Casablanca (ESTC) has 
five departments: 
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a) Computer Engineering (GI) 

b) Electrical Engineering (GE) 

c) Chemistry Process Engineering (GP) 

d) Mechanical Engineering (GM) 

e) Technical Management (TM) 

A student who wants to enroll in the 1st year of the high 
school has to complete an on line form. This form proposes a 
list which can contain until three departments (among 5) that 
the student can join. The goal of our work is to guide the 
student, according to some results he obtained in the 
baccalaureat exam (baccalaureat or bac is a certificate 
/diploma awarded in General or Technical Secondary level in 
Moroccan education system), to choose the department where 
he can obtain better results. The admission is based on the 
overall average obtained in this exam. 

The results we used are the marks obtained in mathematics, 
physics-chemistry and French language. We give interest to 
French because, in pre-higher education, the majority of 
courses are taught in Arabic which is the maternal language, 
while in higher education, all courses are given in French, so 
mastering French is necessary. The main goals of this data 
analytics problem are as follows: 

a) Predict the result that a new student could obtain in 
each department where he is allowed to enroll in. 

b) Recommend the most suitable department to the 
student. 

3) Design data requirement and data collect 

Our work uses two data sources which are related to the 

problem. 

a) A dataset containing the results obtained by students 
in baccalaureat exam of the last 10 years between 
2005 and 2014. The dataset consists of a csv file 
whose headers are presented in table I: 

TABLE I. Attributes Description Of Dataset Corresponding to 
The Results In Baccalaureat 



Attribute 


Description 


Type (domain) 


id 


Student’s identity 


Numeric 


year 


year of obtaining the 
baccalaureat 


Integer 


gender 


Student’s sex 


Nominal(binary:Female, 

male) 


academy 


Origin region of the student 


Nominal (with 16 
values(regions)) 


bac_type 


Baccalaureat type 


Nominal (with 5 values) 


bacAverage 


Overall average obtained in 
the baccalaureat exam 


Continuous (from 0 to 20) 


frMark 


Mark in French language 


Continuous (from 0 to 20) 


mathMark 


Mark in mathematics 


Continuous (from 0 to 20) 


pcMark 


Markin physics-chemistry 


Continuous (from 0 to 20) 
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b) Dataset extracted from a database containing all the 
results obtained by the last 10 promotions in the 1st 
year of the high school (ESTC). The dataset consists 
of a csv file whose headers are described in Table II: 



TABLE II. ATTRIBUTES DESCRIPTION OF DATASET CORRESPONDING TO 
THE RESULTS OF THE LAST 10 PROMOTIONS 



Attribute 


description 


Type(domain) 


id 


student's identity 


Numeric 


year_study 


year of study 


Integer 


dept 


department 


Nominal (with 5 values 
(departments)) 


fyAverage 


first year average 
of all results 


Continuous (from 0 to 20) 



4 ) Preprocessing data 

To process data, we begin by storing in the HDFS 
distributed storage system the two files corresponding to the 
two datasets presented above. Then, we write a R program 
using distributed mode and Rhadoop integration to merge the 
two csv files in one csv file named data2005_2014. The 
merge uses the student's id and is completed by the equijoin() 
function. The headers of the file data2005_2014.csv are: id, 
year, gender, academy, bac_type, bacAverage, frMark, 
mathMark, pcMark, dept, fy Average. 

To exploit the resulting file stored in HDFS, we begin by 
writing a MapReduce program in R by using 
make.input.format() function which takes the path of 
data2005_2014.csv as input, and translates this csv file into 
format that can be read by rmr2 functions. Then, we start the 
different data analytics across the Hadoop cluster. 

5) Training models on data 
A. Classification 

To train the classification models we discretize fy Average 
attribute which is continuous by converting it into categorical 
attribute according to the following partitioning: 

• Risk : if fy Average <12 (to succeed, a student must 
get a global average greater or equal to 12) 

• Acceptable : if 12 < fy Average <14 

• Good: if 14< fy Average 

fy Average is defined as response variable. 

1 ) Decision tree Model 

First, from data2004_2015 we create a randomly ordered 
dataset, then we divide it into two portions. The first portion, 
containing 90 percent, is used as training dataset to build the 
classification model. The second, containing 10 percent, is 
used as test dataset to evaluate the performance of the model. 
The C50 package [20] contains an implementation of C5.0 
decision tree algorithm, which uses entropy for measuring the 
impurity of a set of samples according to their target 
classification. The entropy of a set S of samples is: 

c 

Entropy(S) = p, log 2 (p . ) 

1=1 
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Where c is the number of different class levels, and pi is the 
proportion of samples of class level i in S. The information 
gain of an attribute A is the reduction in entropy, which can be 
expected if a split is made on the basis of this attribute. It is 
calculated as the difference between the entropy before the 
split SI and the partitions resulting from the split S2 

InfoGain (A) = Entropy (SI) - Entropy (S2) 

By maximizing the information gain and minimizing the 
entropy, the attribute with the highest information gain is 
selected to create homogeneous groups. 

We use the C5.0() function in mapreduce functions with 
default settings. Then the model resulted is applied to the test 
dataset by using the predict() function. Table III shows its 
evaluation performance. 



TABLE III. Performance Of C5.0 Model With Default Settings. 



Model 


accuracy 


Kappa 

statistic 


F-measure 

good 


F-measure 

acceptable 


F-measure 

risk 


C5.0 


89.3% 


0.84 


0.92 


0.82 


0.92 



Adaptive boosting is one of features of C5.0 based on the 
work of Rob Schapire and Yoav Freund [21]. This feature 
allows to control the number of the boosting iterations. By 
default C5.0 starts with trials parameter equal to 1 iteration. 
Adding 10 trials which imply building 10 separate decision 
trees instead of one, we improved accuracy, from 89.3 to 97%. 
The boosted model performance is shown in Table IV. 

TABLE IV. Performance Of Boosted C5.0 Model 



Model 


accuracy 


Kappa 

statistic 


F-measure 

good 


F-measure 

acceptable 


F-measure 

risk 


C5.0 


97% 


0.94 


0.97 


0.90 


0.97 



2) SVM model 

SVM learner requires each data sample is described with 
numeric features. We use 1-of-K coding to convert all 
categorical attributes into numeric data.To produce the training 
data set and the test dataset, we use the same method in 
previous classification with Decision tree. 

kernlab package [22] is used to build different SVM models, by 
changing the value of kernel argument in ksvm() function. We 
start from simple linear kernel “vanilladot", to complex non- 
linear kernels like "rbfdot" Radial Basis Function kernel 
(Gaussian kernel) , "polydot" Polynomial kernel, "tanhdot" 
Hyperbolic Tangent kernel, "laplacedot" Laplacian kernel, and 
"besseldot" Bessel kernel. 

The model with Radial Basis Function kernel presents a 
good performance comparing to others kernels as shown in 
Table V. 
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TABLE V. Performance Of SVM Model With Radial Basis 
Function Kernel 



Model 


accuracy 


Kappa 

statistic 


F-measure 

good 


F-measure 

average 


F-measure 

risk 


SVM 


80% 


0.69 


0.90 


0.63 


0.85 



3) Conclusion 

Both tables IV and V show clearly that the model C5.0 
allows to obtain better results than SVM model. It presents 
good prediction performances with accuracy equal to 97%. To 
exploit the model in deployment phase, we extract it from 
HDFS to the local file system. 

B. Clustering model 

We worked with a subset of dataset where observations 
have the "good" value for the fy Average attribute. To avoid the 
problem of the dominance of large values of certain attributes, 
z-score standardization is applied to normalize continuous 
attributes. 

We use stats package [23] which contains an 
implementation of k-means algorithm. 

The k-means() function is invoked in mapreduce functions, 
and the model resulted has 5 clusters. Table VI shows the 
student’s distribution according to department over the five 
clusters, where the recommended department has the highest 
percentage. 

TABLE VI. Student’s Distribution Over The Five Clusters 



cluster 


GE 


GI 


GM 


GP 


TM 


1 


13.45% 


8% 


67.3% 


4.55% 


6.64% 


2 


57.08% 


12.5% 


3.96% 


7.71% 


18.7% 


3 


4.85 % 


64.6% 


4.3% 


3.08 % 


23% 


4 


19.62 % 


21.5% 


15.8% 


35.2% 


7.76% 


5 


2.28% 


23.2% 


15.1% 


10.4% 


48.8% 



VI. Deployment 

To deploy our models in the production environment, we 
develop a web application using JavaScript libraries as shown 
in Figure 2 and Figure 3. This application provides a dashboard 
with interactive visualization, and offers features which are 
divided in two main parts: 

a) Exploratory data analysis 

• To visualize the data stored in Hadoop cluster 

• To summarize the main characteristics of students: 
proportion of sex, proportion of origins, etc. 

• To visualize the evolution of student performance 
through years. 
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b ) Predictive analysis help us to study and analyze more complex data from various 

The decision-maker submits the dataset of the new sources in the aim to analyze the real-time interaction between 

enrollers, then he gets the result of the prediction for each students and the environment where they learn, 

department, and the recommended department. 



ESTC Casablanca -RUM Laboratory 



Year ; Range of Interest 

uk [JJD H2D 




BaeType: 

Physique chim* " 

Predict result and Recommend 
department 

Upload data 

Browse. .. No tiles selected 




Number of observations to display: 

10 

AGel result 



1 2616 
2 2611 

3 2612 

4 2613 

5 2614 



14.34 15.86 15.84 
13.36 14.31 15.26 
13.86 14.48 15.43 
15.68 14.78 14.62 
14.16 14.59 13.89 



14.28 

13.89 

15.63 

14.41 

14.86 



Figure 2. Marks analysis according to range of year, department and bac type 

ESTC Casablanca -RUM Laboratory 




Figure 3. Predict result and recommend department 



VII. Conclusion 

The main goal of this work is to help to reduce the failure 
rate in the first year of a high school, and improve the 
enrollment management system. To realize that, we adopted R 
and Hadoop association to build a platform for analytics on 
massive data collected between 2005 and 2014. We built a 
classification model with C5.0 algorithm to predict the first 
year result for new students, and we also built a 
recommendation system by using k-means algorithm to 
recommend the suitable department. We finished by producing 
an interactive dashboard to visualize and deploy the finished 
models in the production environment. 
In the nearest future, we think that the use of this platform will 
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Abstract — in this paper authors will discuss about metadata of 
digital evidence, digital chain of custody and framework for 
digital evidence management. A concretization of DEMF - Digital 
Evidence Management Framework will be presented. DEMF is a 
framework that was developed on conceptual - ontology level, 
and now programmed in java. 

One case study from real criminal case through DEMF will be 
presented. 



Keywords - digital evidence management framework, DEMF, 
inter er ability, digital evidence, chain of custody 



I. Introduction 

The concept “chain of custody” is not new in digital forensic 
domain. There is a lot of published scientific paper and 
research about this concept in digital forensic domain too. This 
concept was developed in medicine, engineering, construction, 
criminal law, and in each domain where it is necessary to have 
a chronological consistency of some events with 
accompanying metadata. A similar situation exists in the field 
of digital forensics. Within the domain of digital forensics, in 
the digital evidence management, chain of custody includes 
documentation of everything, the chronological consistency of 
“events” related to digital evidence. If that chain of custody 
concept is digital, in digital world, and if in addition to all 
metadata, 5ws&lh can assure the integrity of digital evidence 
- that will be a big step forward. 

In this paper, the authors will analyse this issue, describe 
DEMF 1 using ontology and OWL, and make one case study 
from a real life. In this case study, authors will simulate an 



DEMF stands for Digital Evidence Management Framework (Cosic&Baca, 
2010) 



actual case, in which the ’’first responders” in particular digital 
investigation found three digital evidence. They were added to 
the ".demf' container using DEMF applications. After that, 
complete container, containing the original digital evidence and 
metadata, was encrypted with AES256 method, and was 
delivered to another participant in this chain - the investigator. 
He re-analysed the entire case again, ’’processed” these three 
digital evidence, added another file - new digital evidence into 
the container, and packed again everything in a new container - 
“.demf and secured the container with a AES256 key. 

Court expert, hired by the court for this case, received the 
court order, as well as the container with the necessary key for 
decryption. He checked the files, which, in this case are digital 
evidence, compared the hash values, as well as time stamps, 
geo-information, and other metadata that he secured through 
the DEMF. After completion of the findings, report and 
opinion, the complete case was returned to the court for further 
proceedings. 



II. DIGITAL CHAIN OF CUSTODY 

Term “Chain of preservation” or “chain of custody” 
according to [1] refers to a complete audit and control of 
original evidence material that could potentially be used in 
legal purposes. According to NIJ 2 [2] chain of custody is a 
process that maintains and documents chronological history of 
evidence (Document must include all data, name and surname, 
and/or initials of person who collected evidence, every person 
or entity who had access to evidence, date of collecting the 
evidence or their location change, name of the agency and 
case number, name of the victim or suspect - detailed 
description of all). 



2 NIJ - National Institute of Justice (US National Institute of Justice) 
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Today, some authors use term "chain of evidence" rather than 
"chain of custody". The purpose of testimony about the chain 
of evidence is to prove that digital evidence has not been 
changed at any stage of forensic investigation and must 
include documentation on how it was collected, where it was 
collected, how was it transported, analysed and stored. 
Interruption of chain of custody leads to suspicion that 
evidence was changed, falsified or replaced. It is not enough 
for court to know only the exact location of digital evidence, 
but entire way of movement for the whole time must be 
recorded. Record and strict control of access to digital 
evidence also must be kept [3]— [ 11]. 

III. ONTOLOGY DESCRIPTION OF DEMF 

For these reasons, the authors [12] [10] developed DEMF- 
Digital Evidence Management Framework, a conceptual 
framework that allowed not only for digital evidence to be 
chronologically followed in the chain of custody in order to 
monitor what, who, when, how, why and where came into 
contact with digital evidence, but it also secured preservation 
of the integrity of digital evidence through the use of hash 
functions and AES256 encryption. 

In their early work [13] authors have tried to present DEMF 
by using ontology and OWE in order to present the knowledge 
in the domain of digital forensics and digital evidence, and 
"reusability" as well. Ontology was subsequently expanded, 
and apart from class, properties, and axiom, instances were 
created as well as rules that enable reasoning and help for 
individuals who make decisions on acceptability (judges, 
etc...). 




Figure 1 UML Class diagram with joining of main activities in DI process 



Created DEMF ontology enables to create specific instances 
with the properties, take advantage of "reasoner" and tuned 
finished rules in the ontology in order to accurately determine 
which digital evidence is acceptable and which is not. Any 
digital evidence that meets the requirements of 5ws&lh will 
be “formally” acceptable. 

For the purpose of case study, we will enter some data from a 
particular case into the DEMF application and then check the 
application outputs: 



Created ontology is published online at the OntoHub at 
following link: http://ontohub.org/demf/DEMF V 1 .owl . 

During ontology creation process, authors are especially careful 
on reusability, and therefore they include in DEMF a following 
ontology: “Small Scale Digital Devices Ontology”, ’’Digital 
Evidence Ontology” and “Cyber Forensic Ontology” 
[14][15][13] 



IV. USING DEMF IN REAF ENVIRONMENT 



Figure 1 presents a class diagram of typical activities in digital 
investigation process. 

These activities describe a digital investigation process, and 
would be essential for our example (study). 



[Tue Nov 10 10:17:19 CET 2015] User logged in DEMF with ID = 
JASMIN COSIC from IP address = 31.176.223 

[Tue Nov 10 10:17:19 CET 201 5] Working directory: C:\JRE_DEMF 
(x86) 

[Tue Nov 10 10:17:25 CET 2015] Case CrimeSceneBihac1 1 10201 5 
loaded in memory. 

[Tue Nov 10 10:17:43 CET 201 5] Getting a hash of file .. 

[Tue Nov 10 10:17:43 CET 2015] Hash file is retrieved. 

[Tue Nov 10 10:17:43 CET 2015] Reading a meta-data .. 



[Tue Nov 10 10:17:44 CET 2015] Meta-data of file is retrieved = 
Name-B.Form_1b_-_PhD2015.doc cp:revision=4 date=201 5-06- 
02T10:52:00Z Company= Keywords= meta:word-count=643 subject= 
dc:creator=pc7 extended-properties:Company= meta:print- 
date=201 5-06-02T1 0:49:00Z Word-Count=643 
dcterms:created=2015-06-02T10:49:00Z dcterms:modified=201 5-06- 
02T10:52:00Z La st-Modified=2015-06-02T10:52:00Z title= 
Last-Save-Date=2015-06-02T10:52:00Z 
meta:character-count=3668 Template =Normal.dotm meta.save- 
date=201 5-06-02T1 0:52:00Z dctitle^ WBS^^^mmMMM 
Application-Name=Microsoft Office Word modified=2015-06- 
02T10:52:00Z Edit-Time=1 800000000 cp:subject= Content- 
Type=application/msword X-Parsed- 
By=org. apache, tika. parser. DefaultParser X-Parsed- 
By=org. apache. tika.parser.microsoft. OfficeParser creator=pc7 
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meta:author=pc7 dc:subject= extended- 
properties:Application=Microsoft Office Word meta.creation- 
date-201 5-06-02 T1 0:49:00Z Las t-Printed=2015-06-02T10:49:00Z 
meta:last-author= Comments^ Creat ion-Date=201 5 - 
06-02T10:49:00Z xmpTPg:NPages=2 Lasf-7\i/tf)or=^ESEBEE 
w:comments= Character Count=3668 Page-Count=2 Revision- 
Numbed extended-properties:Template=Normal. dotm 
meta:keyword= Author=pc7 comment- meta:page-count=2 



In the aforementioned log file, we can find generated 
metadata, which describe digital evidence and application: 
User ID, IP address, Digital evidence hash value, Place and 
geo-data of evidence, and finally a Time stamp. Court order - 
the Reason of access and Procedures are fixed variables that 
we entering during the first use of the application. 

Timestamp that we get in DEMF are from trusted 
time stamping services (obtained from internet), and geo-data 
are also obtained from web services. This can be configurable 
and we can have up to 5 different servers. 
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Figure 2 The complete history of accessing to digital evidence 



On next cycle, in the further use of the “.demf” container, if 
the integrity of digital evidence were compromised, DEMF 
would give an alarm alert - as following: 



[TueNov 10 10:23:21 CET 2015] WRONG HASH. 

Existing = 

3c55d29dfcda36f97293f4104c64d36a252242bbd554a953d762 
2be7fce93e4c , 

Current = 

f62 1 bb81 6 7 7242fcl91 59b9d892e 734b593c9ca9db 7be8d9 777 
91939c2746e55 



Now, if we define acceptability of digital evidence as[17]: 

OriginalDigitalEvidence ( ?x) A hasFingerprint ( ?x, 

?fp) A hasHashValue ( ?x, ?sha2) A 

hasEvidenceLocationCoordinates (?x, ?gps) A 

hasEvidenceProcedure (?x, ?proc) A 

hasAccessReason (?x, ?reas) A hasAccessTime (?x, ?ts) 
-► DigitalEvidenceAcceptable ( ?x) 

We can say that such digital evidence is formally acceptable. 



Considering that digital evidence, like any other digital file, 
have a life cycle [16], every new access to container that 
generates DEMF (.demf files), shall preserve a complete history 
of 5WS & 1H. At the end of the investigation process, when 
the evidence must be presented to the court, investigators will 
have uninterrupted and preserved chain of custody. In Figure 2, 
we can see the whole access history to digital evidence in a 
particular case. 



Why formally acceptable? Because it is ultimately the last 
decision of the court (judge or jury) whether the evidence will 
be admissible or not admissible! 

Judges decide what evidence will or will not be allowed in 
their courtrooms. How do judges make this decision - 
according to Fry, Daubert principle, stochastic or in another 
way, is especially question! 



Today, there is a very few authors deals with the problem of 
acceptability of digital evidence [17] [18] 



CONCFUSION AND FURTHER RESEARCH 



The problem related to DEMF is a problem that is typical for 
any other case management systems - interoperability and data 
exchange with other systems. 

There is no adopted standard and procedures for the 
exchange and interoperability of information/data between 
different agencies and countries. Every agency that wants to 
’’read" for example, data from “.demf” containers must have a 
secret key, and of course installed DEMF application or at least 
DEMF reader. 
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Author [19] in his work summarizes the strengths and 
weaknesses of existing digital evidence schemas, and proposes 
the open-source CybOX schema as a foundation for storing and 
sharing digital forensic information. An additional open-source 
schema and associated ontology called Digital Forensic 
Analysis expression (DFAX) is proposed that provides a layer 
of domain specific information overlaid on CybOX. 

In future research, the authors will deal with this problem, 
present DEMF through Cybox and DFAX and try to fill this 
gap by representing the metadata in the file “.demf” using the 
proposed DFAX standard. 
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Abstract — The field of graph indexing and query processing has 
received a lot of attention due to the constantly increasing usage of 
graph data structures for representing data in different application 
domains. To support efficient querying technique is a key issue in 
all graph based application. In this paper , we propose an index trie 
structure (GCTrie) that is constructed with our proposed graph 
representative structure called graph code. Our proposed GCtrie 
can support all types of graph query. In this paper , we focus on 
index construction , subgraph query and supergraph query 
processing. The experimental results and comparisons offer a 
positive response to the proposed approach. 

Keywords -graph indexing and querying; graph representative 
structure; index; subgraph query; supergraph query 

I. Introduction 

Many scientific and commercial applications urge for 
patterns that are more complex and complicated to process than 
frequent item sets and sequential patterns. Such sophisticated 
patterns range from sets and sequences to trees, lattices and 
graphs. As one of the most general form of data representation, 
graphs easily represent entities, their attributes and their 
relationships to other entities. The significant of using graphs to 
represent complex datasets has been recognized in different 
disciplines such as chemical domain [6], computer vision [7], 
and image and object retrieval [8]. Various conferences over 
the past few years on mining graphs have motivated 
researchers to focus on the importance of mining graph data. 
Different applications result in different kinds of graphs, and 
the corresponding challenges are also quite different. A graph 
describes relationships over a set of entities. With nodes and 
edges labels, a graph can depict the attributes of both the entity 
set and the relation. For example, chemical data graphs are 
relatively small but the labels on different nodes (which are 
drawn from a limited set of elements) may be repeated many 
times in a single molecule (graph). 

Storing the graphs into large datasets is a challenging task 
as it deals with efficient space and time management. Over the 
years, a number of different representative structures have been 
developed to represent graphs more and more efficiently and 
uniquely. Developing such structures is particularly 
challenging in terms of storage space and generation time. 
Among many representative structures adjacency list [9] and 
adjacency matrix [10] are the most common. We have already 
proposed a new graph representative structure called graph 
code [1]. Graph code is a new way of representing graphs to 
support all kinds of graph queries without verifying between 



graph structures. A good graph indexing and querying 
approach should have compact indexing structures and has a 
good power of pruning the false graphs in the dataset. The 
strategy of graph indexing is to move high costly online query 
processing to off-line index construction phase [2]. Chemical 
graphs in datasets are undirected labelled graphs. So, graph 
code is developed to process undirected labelled graphs. Graph 
code can retain the structural information of original graph 
such as which two edges are connected on which vertex. 

To effectively understand and utilize any collection of 
graphs, an approach that efficiently supports elementary 
querying mechanism is crucially required. Given a query 
graph, the task of retrieving related graphs as a result of the 
query from a large graph dataset is a key issue in all graph 
based applications. This has raised a crucial need for efficient 
graph indexing and querying approaches. A primary challenge 
in computing the answers of graph queries is that pair-wise 
comparisons of graphs are usually really hard problems. It is 
apparent that the success of any graph based application is 
directly dependent on the efficiency of the graph indexing and 
query processing mechanisms. Recently, there are many 
techniques that have been proposed to tackle these problems. 

In principle, queries in graph datasets can be broadly 
classified into the four categories: graph isomorphism query, 
subgraph query, supergraph query, and similarity query. Most 
of the existing graph indexing and querying approaches 
proposed to deal with only one type of the query problem. Our 
proposed approach allows the chemical compound dataset to be 
queried chemical structures in terms of XML file format. Using 
proposed approach, all types of graph queries can be processed. 
After entering a chemical structure as a query, user can process 
their desired query types. In this paper, we describe our 
proposed graph code structure, and GCTrie, and also perform 
subgraph query and supergraph query processing by probing 
GCTrie. We also perform experimental analysis on index 
construction and on these queries using proposed approach and 
other existing approaches. 

II. Preliminaries 

For simplicity, we present the key concepts, notations, and 
terminology used in our proposed approach which includes 
labeled undirected graph, graph automorphism, subgraph 
query, supergraph query, and graph code. 

As a general data structure, labeled graphs is used to model 
complicated structures and schemaless data. In labeled graph, 
vertex and edge represent entity and relationship, respectively. 
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The attributes associated with entities and relationships are 
called labels. XML is a kind of directed labeled graph. The 
chemical compound shown in Fig. 1 is labeled undirected 
graph. 



s 




3 C C 4 




Figure 1 . A Labeled Undirected Graph 

Definition 1 . Labeled Undirected Graph 

A labeled undirected graph G is defined as 5-tuple, (V, E , 
Ly, L e , /) where V is the non-empty finite vertex set called 
vertices, and E is the unordered pairs of vertices called edges. 
L y and L E are the set of labels of vertices and edges and / is a 
labelling function assigning a label to a vertex /: V to L v and 
an edge /: E to L E . 

Definition 2. Graph Isomorphism 

Let G = (V, E , L v , L E , /) and G' = ( V , E\ L ' v , L' E , l ) be two 
graphs. An automorphism between two graphs G and G' is an 
isomorphism mapping where G = G'. An isomorphism 
mapping is a mapping of the vertices of G to vertices of G' that 
preserve the edge structure of the graphs. That is, it is a graph 
isomorphism from a graph G to itself. 

Definition 3. Subgraph Query 

This category searches for a specific pattern in the graph 
dataset. The pattern can be a small graph. Therefore, given a 
graph dataset D = {G 7 , G 2 ,..., G,} and a subgraph query q , the 
answer set A = {G t \q Q G it G, ED}. 

Definition 4. Supergraph Query 

Given a graph dataset D = {G h G 2 ,. . ., G,} and a supergraph 
query q , if a query g is a supergraph of a dataset graph, the 
answer set A = { G z l G, ^q,Gj ED } . 

Definition 5. Graph Code 

For a graph G„ the code of G„ denoted by c(G ? ) is the list of 
the form e id [(v), e id _ adj ] . . . depending on adjacent edges. is 
the edge id, v is vertex label on which two edges are connected, 
e id ad j is adjacent edge id for this edge. 

III. Related Works 

Graphs are used to represent many real life applications. 
Graphs can be used to represent networks. The networks may 
include paths in a city or telephone network or circuit network. 
Graphs are also used in social networks like linkedln, 
facebook. Many graph datasets (e.g., chemical compounds) 
have more than one vertex with the same label. Same graph is 
stored more than once in the graph datasets leading to adverse 
results of mining. To ensure the consistency of graph datasets, 



required a mechanism to check whether two graphs are 
automorphic or not. So, detection and elimination of 
automorphic graphs is required. In proposed approach, a graph 
is represented via its graph code generated by using adjacent 
edge information and edge dictionary. Instead of expensive 
graph automorphism test, automorphic graphs can be detected 
by matching codes of two graphs [3]. 

GraphGrep [4] was proposed that is a path-based technique 
to index graph datasets. It has three basic components: 
building the index to represent graphs as sets of paths, filtering 
dataset based on query and computing exact matching. 
GraphGrep enumerates paths up to a threshold length (l p ) from 
each graph. An index table is constructed and each entry in the 
table is the number of occurrences of the path in the graph. 
Filtering phase generates a set of candidate graphs for which 
the count of each path is at least that of the query. Verification 
phase verifies each candidate graph by subgraph matching. 
However, the graph dataset contains huge amount of paths and 
can have an effect on the performance of the index. 

OrientDB [5] is an open source NoSQL database 
management system written in java. It is a multi-model 
database, supporting graph, document, key/value, and object 
models, but the relationships are managed as in graph 
databases with direct connections between records. It supports 
schema-less, schema-full and schema-mixed modes. It has a 
strong security profiling system based on users and roles and 
supports querying with SQL extended for graph traversal. 

IV. Proposed Approach 

In our proposed approach, there are three main phases: 
code generation phase, subgraph query and supergraph query 
processing phase, and graph isomorphism query and similarity 
query processing phase. There are three sub-steps in code 
generation phase. These are preprocessing, code generation and 
automorphism checking, and index construction. In subgraph 
query and supergraph query processing phase, there are four 
sub- steps: preprocessing, code generation, subgraph querying 
and super graph querying. In graph isomorphism query and 
similarity query processing phase, there are also four sub-steps, 
preprocessing, code generation, and graph isomorphism 
querying and similarity querying. In this paper, we focus on 
index construction step, and subgraph query and supergraph 
query processing phase. 

A. Preprocessing, Code Gene ration and Automorphism 

Checking 

In preprocessing, the graph information such as vertex 
information, edge information, and adjacent edge information 
are generated by parsing input xml files with xml parser. The 
edge information of the graph is defined as (V id ,L,V id ) where V ld 
is the vertex id, L is the edge label. Then adjacent edge 
information is generated. Fig. 2 shows graph information for 
graph Gj in Fig. 1. 
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Vertex id : 1 2 3 4 5 

Vertex Info: C C C C O 







(a) 








VuAVfi 


<l,s,2> 


<l,s,3> 


<2,s,4> 


<3,s,5> 


<4,s,5> 


Edge Info: 


<C,s,C> 


<C,s,C> 

(b) 


<C,s,C> 


<C,s,0> 


<C,s,0> 



Edge: Adjacent Edges: 

<l,s,2> <l,s,3>, <2,s,4> 

<l,s,3> <l,s,2>, <3,s,5> 

<2,s,4> <l,s,2>, <4,s,5> 

<3,s,5> <l,s,3>, <4,s,5> 

<4,s,5> <2,s,4>, <3,s,5> 

(c) 



Figure 2. Graph Information of Gi (a) Vertex Information (b) Edge 
Information (c) Adjacent Edge Information 

For each edge from the graph’s edge information, check the 
edge dictionary to determine whether the edge is already 
existed in edge dictionary or not. If not, insert new edge into 
edge dictionary. Then, the edge ids are associated with their 
corresponding edges in graph’s edge information. Edge 
dictionary is shown in Fig. 3. 



Id 


Edge 


1 


<C,s,C> 


2 


<C,s,0> 







Figure 3. Edge Dictionary 



A graph is represented holistically into a graph code that 
preserves the structural information of the graph. Every edge in 
the graph is assigned with global unique identifier already 
defined in the edge dictionary. Instead of using the edge itself, 
using the edge id of the edge dictionary can have advantages in 
three ways: 

• Firstly, using the edge id in the code saves the amount 
of storage space. 

• Secondly, using the same id for the duplicated edge is 
effective when constructing the graph code. 

• Thirdly, using the edge id in the code reduces the 
time for finding automorphic or isomorphic graphs. 

Most of the chemical graphs have a lot of common edges. 
So, edge dictionary uses little memory space. Edge dictionary 
and adjacent edge information are used to generate graph 
code. Graph code for graph G 7 is as follows: 

c(G 7 j=l[c,l],l[c,l],l[c,l],l[c,2],l[c,l],l[c,2],2[c,l],2[o,2] 

,2[c,l],2[o,2] 
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After computing the graph code of G k , compares it with 
each graph code Gi in code store (CS), 1 <= i < k, to check 
graph automorphism. If the graph code of G k has the same code 
as that of G ? , concludes that the two graphs are automorphic 
and append id of G k to corresponding graph code of G t . 
Otherwise, add the graph code of G k to CS assuming as Gk is a 
new graph. 

B. Index Construction 

After generating graph codes for all dataset graphs and 
checking automorphism, the next step is to construct GCTrie 
for efficient querying. Instead of using path or subgraph 
decomposition to support subgraph query type which has 
result in strctural information lost and exhaustive enumeration 
time problems, we propose an index trie structure called 
GCTrie for supporting all types of graph query. We put the 
graph codes of all dataset graphs in GCTrie. A GCTrie is a trie 
where each node except the root node is a string array that 
represents an edge id or an vertex label on which two edges 
are connected. There are five levels in the GCTrie. The 
second and fourth level is for edge ids. The third level is for 
vertex labels and the last is for leaves which are implemented 
by hashmaps of graph ids and their frequencies. Procedure for 
index construction is shown as follows. We represent one 
edge’s adjacent code, e.g; l[o,2] as feature/. 



Procedure. IndexConstruction(CS) 

For each c(G t ) E CS 

For each feature/ E c(Gi) 
Put/in GCTrie 
return GCTrie 



In index GCTrie construction, for the graph code of G 7 
from CS , c(Gi): 1 [c, 1],1 [c, 1], 1 [c, 1], 1 [c,2], 1 [c, 1], 1 [c,2], 

2[c,l],2[o,2],2[c,l],2[o,2], there are four occurrences of 
features l[c,l]. There are two occurrences of features l[c,2], 
2[c,l] and 2[o,2]. So, the GCTrie after putting c(G 7 ) is as 
shown in Fig. 4. 




Figure 4. GCTrie After Putting c(Gj) 
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Then we put all dataset graph codes from CS into this 
GCTrie. When the query graph enters into the system, vertex 
information, edge information and adjacent edge information 
of query graph is generated in preprocessing step. Then the 
query graph code is generated in code generation phase using 
system’s edge dictionary and query adjacent edge information. 
Each feature from query graph code is probed in GCTrie. 

C. Subgraph Querying 

In the core of many graph-related applications lies a 
common and critical problem of how to efficiently process 
subgraph query. In some cases, the success of an application 
directly relies on the efficiency of the query processing system. 
The classical graph query problem can be described as follows: 
given a graph dataset D = {G h G 2 ,...,G n } and a graph query q, 
finds all the graphs in which q is a subgraph. If all of the query 
features are matched with features of data graph codes in the 
GCTrie but the size of dataset graphs are larger than query 
graph size, then the dataset graphs are returned as answer set. 
The following algorithm 1 describes the step-by-step process 
for subgraph query. 

Algorithm 1 SubgraphQuery 

Input : GCTrie, and Query q 

Output : Answer set D q . 

1 . Generate graph code for query q. 

2. Let D q - D 

3 . For each feature qfEc(q) 

4. Probe qf in GCTrie. 

5. If qfe GCTrie 

6. Intersect D q and Dqf. 

7. For each G, E D q 

8. If size(G ? ) < size(g) 

9. Remove G t from D q . 

10. Return/^; 



Assume that we have generated the graph code of the query 
graph. We establish a necessary condition that forms the basis 
for processing subgraph query. Thus we state the following 
theorem. 

Theorem 1 Given a query graph q , if q is a subgraph of a 
dataset graph G, then c(q) <= c(G). 

Proof. By definition, if q is a subgraph of G, then every feature 
of q appears in G. Therefore, if parametric quantities of c(q) are 
contained in c(G), then c(q) <= c(G). 

The intuition is as follows. If a query q is a subgraph of a 
dataset graph, then all of its features are a subset of the features 
of the dataset graph. Therefore, the adjacent edges of each edge 
that appear in the graph code of the query will definitely appear 
in the graph code of the dataset graph. 

D. Supergraph Querying 

Supergraph query searches for the graph dataset members 
of which their whole features are contained in the input query. 
Formally, given a dataset D = {G 1 ,G 2 ,...,G n } and a super graph 
query q, if q is a supergraph of the dataset graphs then all of its 
features form a superset of the features of the resulted dataset 



Vol. 13, No. 12, December 2015 

graphs. The large number of graphs in datasets and the NP- 
completeness of subgraph isomorphism testing make it 
challenging to efficiently processing supergraph queries. 

In our propose approach, when the query graph enters, it is 
represented as a query graph code. Each feature from query’ 
graph code is probed in GCTrie. If all of the query features are 
matched with features data graph codes in the GCTrie but the 
query graph size is larger than the dataset graphs’ size. Then 
the dataset graphs are returned as answer set that are contained 
in query as subgraph. The step-by-step process of supergraph 
query is described as the following algorithm 2. 

Algorithm 2 SupergraphQuery 

Input : GCTrie, and Query q 

Output : Answer set D^. 

1 . Generate graph code for query q. 

2. Let D (j = D 

3. For each feature qf E c(q) 

4. Probe qfm GCTrie. 

5. If qfe GCTrie 

6. Intersect D q and D qf . 

7. For each Gi E 

8. If size(Gi) > size(g) 

9. Remove Gi from D^. 

10. Return D^; 



Assume that we have generated the graph code of the query 
graph. We establish a necessary condition that forms the basis 
for processing supergraph query. Then we state the following 
theorem. 

Theorem 2 Given a query graph q, if q is a supergraph of a 
dataset graph G, then c(G) <= c(q). 

Proof By definition, if q is a supergraph of G, then every 
feature of G appears in q. Therefore, if parametric quantities of 
c(G) are contained in c(q), then c(G) <= c(q). 

The intuition is as follows. If a query q is a supergraph of a 
dataset graph, then all of its features form a superset of the 
feature of the resulted dataset graphs. Therefore, the adjacent 
edges of each edge that appear in the graph code of the dataset 
graph will definitely appear in the graph code of the query. 

V. Experimental Analysis 

A performance analysis for proposed approach is presented 
in this section. The main goal of the experiment is to represent 
the performance evaluation of our proposed approach apply on 
AIDS antiviral screen compound dataset, NCI yeast anticancer 
drug screen dataset, and primary screening dataset for 
Formylpeptide Receptor. 

Index Construction times for graph indexing approaches 
such as GraphGrep, OrientDB and our proposed approach are 
analyzed. All of the approaches are implemented in java on 
Intel(R) Core (TM) i3-4010U CPU with 2GB memory and 
Window7 34-bit operating system. Fig. 5, Fig. 6, and Fig. 7 
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shows the index construction times vary for different 
approaches on three datasets respectively. Various numbers of 
chemical graphs are tested and take the average index 
construction times to compare GraphGrep, OrientDB and our 
proposed approach. For GraphGrep, We use two values 4 and 
10 for parameter: the length of path (l p ). It can be seen that our 
proposed approach consumes at least 10 times less than 
OrientDB in index construction and at least 10 2 times less than 
GraphGrep {l p = 4) and GraphGrep ( l p =10) respectively. 




Figure 5. Analysis of Index Construction Time of Three Different 
Approaches for AIDS Antiviral Screen Dataset 




Figure 6. Analysis of Index Construction Time of Three Different 
Approaches for NCI Yeast Anti-cancer Drug Screen Dataset 




Figure 7. Analysis of Index Construction Time of Three Different 
Approaches for Primary Screening Dataset for Formylpeptide Receptor 
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We evaluate the query response time of our proposed 
approach with GraphGrep on AIDS antiviral screen dataset, 
NCI yeast anti-cancer drug screen dataset and primary 
screening dataset for Formylpeptide Receptor. Since 
GraphGrep only support subgraph isomorphism query, we can 
evaluate subgraph isomorphism query response time with it. 
For GraphGrep, we use two values 4 and 10 for parameter; the 
length of path (l p ). Fig. 8 shows the analysis of subgraph 
isomorphism query response time over AIDS antiviral screen 
dataset. It can be seen that our proposed approach significantly 
reduces at least 10 3 times for subgraph isomorphism query 
response time when compare to GraphGrep. 




Figure 8. Analysis of Subgraph Query Response Time Between GraphGrep 
and Proposed Approach on AIDS antiviral screen Dataset 



Fig. 9 shows the analysis of subgraph isomorphism query 
response time over NCI yeast anti-cancer drug screen dataset. 
Fig. 10 shows the analysis of subgraph isomorphism query 
response time over primary screening dataset for 
Formylpeptide Receptor. It can be seen that our proposed 
approach significantly reduces at least 10 2 times and 10 3 times 
of subgraph isomorphism query response time less than when 
compare to GraphGrep over NCI yeast anti-cancer drug screen 
dataset and primary screening dataset for Formylpeptide 
Receptor respectively. 




Figure 9. Analysis of Subgraph Query Response Time Between GraphGrep 
and Proposed Approach on NCI Yeast Anti-cancer Drug Screen Dataset 
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Figure 10. Analysis of Subgraph Query Response Time Between GraphGrep 
and Proposed Approach on Primary Screenin Dataset for Formylpeptide 
Receptor Dataset 

VI. CONCLUSIONN AND ONGOING WORKS 

Proposed graph code used edge dictionary and adjacent 
edges information to preserve the structural information of the 
original graph. Instead of expensive pair-wise comparisons, it 
can be efficiently used to detect automorphic graphs. Instead 
of path or subgraph decomposition process which could result 
in structural information lost and exhausted enumeration time, 
GCTrie is used to support all query types. From our 
experimental results, proposed approach outperforms the 
existing methods in index construction time and subgraph 
query response time. Similarity query processing is going to 
be observed as our ongoing work. 
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Abstract — Cloud is one of the most important shareable 
environments which several customers are connected to it in 
order to access services and products. In this environment all the 
resources are available in an integrated environment and several 
users can simultaneously send their requests. In this condition 
some methods are required to schedule in an efficient way. 
Important condition is when a cloud server is overloaded or 
request of user has deadline constraint. In these situations to 
provide efficient service, migration occurred. Migration of virtual 
machines can be used as a tool for achieving balance. In other 
words when load of virtual machines in a physical machine 
increased, some of the virtual machines migrate to another 
physical machine so that virtual machines do not face any 
shortage of capacity. Virtual machine migration is used for many 
reasons like reducing power consumption, load balancing in 
virtual machines, online maintenance and capability of 
responding to requests with deadline constraint. 

In this thesis, we present an efficient algorithm for scheduling live 
migration of virtual machines which have applications with 
deadline constraint. This scheduling uses thresholds for 
migrating virtual machines in order to guarantee service level 
agreement (SLA). Simulations show that the proposed method 
acts better in number of received requests with success and level 
of energy consumption. 

Keywords- Cloud computing; Virtual machine migration; 
Threshold; Cloudsim; SLA; Scheduling 

I. Introduction 

With new and innovative ideas for new internet services, 
software developers don’t need to spend a lot of their 
investments on preparation of hardware and hiring people for 
triggering their services. They don’t need to worry about 
overprovision of un-popular services or under provision of 
popular services. This feature is considered as a major 
advantage for information technology corporations. Because 
instead of preparation of servers and software infrastructures 
they have ability to focus on innovation and construction of 
business values. 

Cloud computing is an instance of distributed computing in 
large scale for economies of scales that is managed like a pool 
of computing and storage power, platform and virtual services. 
Cloud computing has dynamic scalability which is delivered to 
remote costumers in internet on demand. 
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Cloud computing provides infrastructure, platform and 
software for costumers as a subscription-based services that are 
available on demand. These services are called infrastructure as 
a service, platform as a service and software as a service 
respectively. In other words, cloud computing is a model for 
supporting everything as a service. 

Virtualization technologies change the method of 
datacenter in resource consumption of servers. Instead of using 
particular servers for each type of applications, virtualization 
provides the ability of observing resources as a pool of 
integrated resources. Therefore complexity reduces and 
management ability is much easier. By virtualization we can 
hide operation details from users. One of the most important 
virtualization techniques is virtual machines migration issue, 
i.e., transfer the allocated process to another virtual machine or 
cloud in order to release load and respond cloud requests 
effectively. 

In fact, virtualization in cloud provides the ability of 
separating the applications and services from physical 
hardware. Therefore with virtualization, several virtual 
machines run on a single physical machine concurrently and 
use its capacity (processing, memory and network). Cloud 
provider should obey service level agreement and avoid its 
violation. Agreement in this level can include items such as 
respond time, job loss rate, energy consumption level, deadline, 
number of migrations and etc. In fact we should achieve a 
tradeoff between SLA parameters. It should be noted that 
because cloud is really huge and also load of virtual machines 
are not predictable, we can’t avoid SLA violation completely. 

The SLA that we focus on in this thesis is the combination 
of considering deadline constraint and reducing energy 
consumption. There are several studies about virtual machine 
migration. 

II . Related works 

The main idea of live migration algorithm was proposed by 
Clark et al [1] in 2005. First of all, hypervisor marks all the 
pages as a dirty page then, algorithm transfer dirty pages over 
the network iteratively while the number of remain pages for 
transfer become less than threshold or the maximum iteration 
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achieved. After that hypervisor marks transferred pages as a 
clean page. 

In 2006, Khanna et al [2] study about union and integration 
of servers. In this method they define thresholds for efficiency 
of resources to avoid application performance degradation. 
This method doesn’t consider the network topology on 
migration algorithm details. 

In Bobraff method in 2007 [3], number of requests 
calculated in regular intervals. Then virtual machines mapped 
on physical machines. 

In 2007 wood et al [4] used virtual machine migration for 
balance establishment between resource requests in physical 
hosts but minimizing cost of servers was not their goal. This 
method uses grey box and black box approach to reduce critical 
zone in virtual clusters by using all facilities of virtual machine 
live migration. 

Bradford et al in 2007 [5], proposed a system to support 
clear and live migration of virtual machines that use local 
storage in their stable states. This method is clear for migrated 
virtual machines and doesn’t create pause in network open 
connections and this guarantee the compatibility of virtual 
machine in source and destination after migration. 

In 2008, Tal Maoz et al [6] present an approach for 
migrating virtual machines and processes in grid and cluster 
environments. Requirements such as high service accessibility, 
resource consumption improvement and management increase 
the need of virtual machine migration. 

In 2009, Van et al [7] proposed architecture for 
management of virtual machines in clouds based on migration 
scheduling. This method has long execution time. 

In 2010, Jing Thai Piao et al [8] present a virtual machine 
placement and migration approach to minimize the time of data 
transfer. Simulation results show that proposed method is 
effective in optimization of data transfer between virtual 
machines so it helps the optimization of general application 
performance. 

In 2011, RicaSinha et al [9] present an algorithm for 
scheduling live migration based on energy consumption that 
uses dynamic threshold. In this method dynamic threshold uses 
to increase processor efficiency in datacenter. 

In 2011, Aziz Murtazaev et al [10] proposed a method that 
first of all sorts nodes based on load of virtual machine in 
decreasing order. Then virtual machines in last node of the list 
(node with lowest load) candidate for migration, virtual 
machines also sort based on weight in decreasing order. After 
that virtual machines one by one allocated to first node (node 
with highest load) and if effort was not successful, then the 
second node chooses and so on. Because the node with lowest 
load usually has the lowest number of virtual machines, so 
number of migration decreases. 

In 2012, Ahmed Abba et al [11], proposed a method for 
scheduling deadline based on jobs in grid environment. This 
method doesn’t consider the migration. Proposed algorithm 
chooses the job with nearest deadline for execution. This 
method uses dynamic time quantum. Jobs sort based on their 
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delay in increasing order. Then the job with minimum delay 
candidate for execution. If delay of two jobs were equal, then 
job candidate based on entrance time. Proposed algorithm 
compare with Round Robin and FCFS scheduling algorithm 
based on average execution time, average waiting time and 
maximum delay. Results show the improvement of these 
parameters. 

In 2013 [12], the method was proposed for virtual machine 
scheduling in cloud datacenter based on energy efficiency. 
Results show that this algorithm only decreases energy 
consumption. 

Quentin Perret et al in 2013 [13] proposed a scheduler 
based on deadline for distributed systems to discover solution 
for data placement management. This algorithm called cloud 
least laxity first and is compared with timeshared and space 
shared scheduling algorithms. In this method each job laxity 
defined as difference between deadline and remaining time to 
complete the job. This is a non-preemptive scheduling because 
in multiprocessing algorithm one of the assumptions is that if 
one job stops in one processor it can continue in another 
processor without considering its cost. Since information saves 
locally here, so if a job transfer to another node must execute 
from the beginning. 

In 2013, Baghshahi et al [14] proposed a virtual machine 
migration method based on intelligent agents. In this paper, 
multiple virtual machines are considered as a cluster. These 
clusters are migrated from a data center to another data center 
with using weighted fair queuing. 

III. Virtual machine live migration 

Live virtual machine migration is a technique that migrates 
the entire OS and its associated application from one physical 
machine to another. The Virtual machines are migrated lively 
without disrupting the application running on it. The benefits of 
virtual machine migration include conservation of physical 
server energy, load balancing among the physical servers and 
failure tolerance in case of sudden failure. Live migration is an 
extremely powerful tool for cluster and cloud administrator. An 
administrator can migrate OS instances with application so that 
the machine can be freed for maintenance. Similarly, to 
improve manageability, OS instances may be rearranged across 
machines to relieve the load on overloaded hosts. To perform 
the live migration of a virtual machine, its runtime state must 
be transferred from the source to the destination while virtual 
machine still running. 

There are two major methods: Post-Copy and Pre-Copy 
memory migration. In the Post-copy method first suspends the 
migrating Virtual machine at the source node then copies 
minimal processor state to the target host and resumes the 
virtual machine, and begins fetching memory pages over the 
network from the source node. 

There are two phases in Pre-copy method: Warm-up phase 
and Stop-and-Copy phase. In warm up virtual machine memory 
migration phase, the hypervisor copies all the memory pages 
from source to destination while the virtual machine is still 
running on the source. If some memory pages change during 
memory copy, they will be re-copied until the rate of recopied 
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pages is not less than page dirtying rate. In Stop and Copy 
phase, the VM will be stopped in source and the remaining 
dirty pages will be copied to the destination and virtual 
machine will be resumed in destination. [15] 

IV. Proposed method 

In this method we use Pre-Copy live virtual machine 
migration. First of all we define number of virtual machines, 
physical machine and applications (jobs). There is an 
assumption that each virtual machine only can respond to one 
request (task) in a moment and other requests placed in a 
queue. Resources (CPU, network, memory, I/O) are allocated 
to each virtual machine in order to handle workload and ensure 
SLA. Another assumption is that the placements of applications 
are definitely clear from the beginning and resources are 
allocated to them. Based on the agreement between customer 
and cloud service provider, each job has a specific deadline for 
execution and if it doesn’t execute in specific time SLA 
violation occurred. So to prevent SLA violation, virtual 
machines that execute the job must migrate. 

Generally the proposed method present in four levels: 

• diagnose suitable time for virtual machine migration 

• select suitable virtual machine for migration 

• define migration time sequence for selected virtual 
machine 

• select suitable destination for migrated virtual machine 
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In fact in this level the policy is to find which host should 
migrate its virtual machine. Equation (1) shows the calculation 
of virtual machine’s waiting time which is the time interval 
between virtual machine creation and job execution beginning 
time. 

VmWaitingT ime = JobStartExeTime - VmCreation Time (1) 

Equation (2) shows virtual machine execution time 
estimation. Job execution time is depend on number of 
executable statement in one second and number of its 
processing elements (Processor) 



JobLength 

VmExecutionTime = 

VmMIPS * VmNoPEs yZj) 

Equation (3) shows the calculation of requested job 
deadline. Deadline is a function of execution time and has 
many ways to calculate. Defined deadline shouldn’t set very 
low because it leads to more SLA violation and it also 
shouldn’t set very high because in this situation we don’t face 
any SLA violation at all. Therefore here the considered job one 
time executes on virtual machine with highest speed and next 
time execute on virtual machine with lowest speed. Based on 
following relation we set a weight for each execution time and 
k=0.6 ensures the considered balance. 



A. Diagnose suitable time for virtual machine migration 

Hosts check periodically and in addition to energy 
consumption below specific threshold and load balancing. 
Another goal is to find hosts that can respond to received 
requests in defined deadline, i.e., sum of waiting time and 
execution time of all their virtual machines is greater than sum 
of defined deadline. So that best time for virtual machines 
migration is when their hosts consumption are higher than 
specific threshold or host has virtual machine that it can’t 
respond to received request in defined deadline. It isn’t 
necessary to migrate all virtual machines of candidate hosts. In 
each level only one virtual machine will migrate. Then virtual 
machines of each candidate host will placed in migration 
queue. Fig.l shows the migratable virtual machines queue in 
each selected host. 
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Deadline = BestCase + K * ( WorstCase - BestCase) 

( k=0.1,..., 0.9) (3) 

After calculation of above relations, host that sum of 
execution and waiting time of its virtual machines is greater 
than sum of deadline of virtual machine’s considered host or 
energy consumption of the host is greater than defined 
threshold, because of avoiding Sla violation and quality of 
service degradation that is based on consumption level and 
deadline, can candidate as a host with high energy 
consumption: 



A = VmExecutionTime + VmWaitingT ime > Deadline 
B = Utilization > UtilizationThreshold 

if[(A)OR(B)] Hostselected (4) 
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Figure 1 . migratable Virtual machines queue 



Figure 2. pseudo code of proposed scenario 
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Re mainLength= jobLength- ( VmMIPS * ( CurrentTim - JobStartE&T imd) ) 

(7) 



B. Select suitable virtual machine for migration 

In this level based on proposed scenario in each selected 
physical host, virtual machines put in a queue. We select a 
virtual machine that its host can’t respond to its request in 
defined deadline. For each virtual machine of selected host 
following method will execute: 

First of all, by (1), virtual machine waiting time will be 
calculated. Then by (2) virtual machine execution time will be 
calculated. Here we consider the remaining of execution time 
i.e., the spent time of requested job subtracted from total 
execution time. Finally by (3), deadline will be calculated. 
Here, we consider the remaining time of deadline i.e., the spend 
time of deadline subtracted from total deadline. After these 
calculations virtual machine that sum of its execution time and 
waiting time is greater than deadline is candidate for migration 
because of Sla violation and QOS reduction avoidance. 

Sum = VmExecutio nTime + VmWaitingT ime 

if (Sum > Deadline ) — » Migration Required (5) 



In this level The remaining time of execution will be 
calculated based on (2) and then the remaining Time of 
deadline will be calculated based on (3). After that (8) will be 
checked: 

If (Vm Re mainExeTime + VmMigrationTime < Re mainDeadline) 

( 8 ) 

After checking the above condition we will check the 
migration of virtual machine to a host in order to have lowest 
energy consumption during migration. Amount of energy 
consumption during migration achieve from (9) [14]: 

PowerDiff = PowerAfter Allocation - HostPower 

HostUtilizationMIPS + Vm Re questedMIPS 

Power AfterAllocation = 

HostTotalMIPS 

(9) 



C. Define migration time sequence for selected virtual 
machine 

In this level, migration sequence of selected virtual 
machines will be defined. After finishing the first level, the 
virtual machine queue id created and is sorted based on 
increasing deadline. 

Qjeue y migrated VMs 
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Figure 3. An example of defining the sequence for VM migration 

D. Select suitable destination for migrated virtual machine 

In this level, every host will be checked. For host that after 
accepting of migrated virtual machine based on defined 
scenario doesn’t place in high consumption hosts group, 
following calculations take place: 

VmRam(MB) 

VmMigrationTime = 

HostBW (Gb / s) 

2 ( 6 ) 

The main part of a virtual machine that will be migrated is 
its memory which size and amount of that will affect the 
migration time. So the time of migration is based on amount of 
memory (information) that host can send in a second. 
Bandwidth divided by 2 because the half of bandwidth will be 
used for migration and the other half will be used for 
communication with destination host. 

Now virtual machine migration time determine in hosts 
respectively in order to find suitable destination for virtual 
machine. Since some part of the job execute before and during 
migration, the remaining length of the job to use in execution 
time calculation achieve from (7): 



TABLE I. Description of used parameters 



Parameters 


Description 


VmWaitingTime 


Time that takes until execution 
of VM starts 


JobStartExeTime 


Start time of job execution in 
VM 


VmCreationTime 


Time of VM creation in host 


VmExecutionTime 


Time that takes until VM 
execution finished 


JobLength 


Job length of VM based on 
million instructions per second. In 
fact this parameter shows the number 
of instructions that should execute to 
complete a job 


VmMIPS 


Number of million instructions 
that execute in a second 


VmNoPEs 


Number of VM’s processing 
elements 


Deadline 


Deadline of VM in order to 
execute its job 


BestCase 


best-case of requested job 
execution time 


WorstCase 


worst-case of requested job 
execution time 


RemainLength 


remain length of job 


VmRemainExeTime 


Vm remaining execution time 


RemainDeadline 


remaining time of deadline 


VmMigrationTime 


Vm migration time 


VmRam 


Ram of migrated VM 


HostBW 


Bandwidth of host 


PowerDiff 


Difference between host energy 
consumption after migration and 
energy consumption of host 


PowerAfterAllocation 


Host energy consumption after 
migration (VM allocation) 


HostPower 


Host power consumption 
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Parameters 


Description 


HostUtilizationMIPS 


Host utilization before 

migration in million instructions per 

second 


VmRequestedMIPS 


Migrated Vm requested million 
instruction per second 


HostTotalMIPS 


Total million instructions that 
host can execute in a second 
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Support of virtual machine migration is a cause of using this 
simulator. 

B. Simulation of proposed method in cloudsim 

In this simulation, we use one cloud and migration is 
between physical machines in this cloud. In this scenario we 
use one datacenter which has following features: 

• System architecture : x86 



Generally, proposed algorithm will execute again after a 
specific time slot (5min) in order to check hosts condition. In 
this level execution and deadline remaining time will be 
evaluated. Reasons for using proposed method to schedule 
virtual machine migration: 

• Load balancing 

• Decrease energy consumption caused by virtual 
machine migration and disable idle servers. 

• Suitable for environments that have to respond to 
requests in specific time and therefore ensure the 
SLA between customer and cloud service provider 

Afterward, we will check implementation and evaluation 
the results. 



1. input : 

2. HostList 

3. vm that needs migration 

4. Current Time 

I initialize minPower as max value of double 

6. for each Host do 

7. Calculate Ym remain Execution Time 

5. Calculate Vm remain Deadline 

9 . C alculate Vm Migration Time 

10. Cakiiate sum ofExe Time and Migration Time 

I I .C alcul ate P ower after all oc ati on 

12. If(sum<=Deadline} 

13 . find ho st that migration to it has minimum power consumption 

14. addhostto allocated host 



Figure 4. pseudo code of Select suitable destination for migrated virtual 
machine 



V. Implementation of proposed method 

For implementation of proposed algorithm we use cloudsim 
simulator that use java programming language in eclipse 
environment. 

A. Definition of cloudsim 

There are several simulators for cloud computing 
simulation that most of them are based on java. Cloudsim 
simulator, support system components behavior modeling like 
datacenter, virtual machines and preparing policies. 
Components like datacenter, resources, broker and cloud 
information service are defined as an entity in cloudsim. 



• Operating system : Linux 

• Virtual machine monitor : XEN 

In this datacenter, there are 50 physical machines with 
following features: 

• Number of physical machine types : 2 types 

• Number of processing elements : 2 

• Amount of ram : {4096,8192} which is based on 
type of physical machine 

• Amount of storage : 1GB 

• Amount of bandwidth : lGB/s 

Type of physical machines calculates as follows in each 
stage: 

hostType = i % No of Host Type 

Features of virtual machines are as follows: 

• Number of virtual machine types : 10 types 

• Number of million instructions per second : 
between 500 and 2500 MIPS which is based on 
type of virtual machine 

• Number of processing elements : 1 

• Amount of bandwidth : between 100 and 118 
Mb/s 

• Size of virtual machine : 2.5 GB 

Type of virtual machines calculates as follows in each 
stage: 



vmType = 

vmsNumber 

No of vm Type 
Features of requested job : 

Here, we consider 6 different types for jobs that each type 
has different start time and length. Type of requested job 
calculates as follow in each stage: 



CloudletTy pe 

cloudletsN umber 



No of Cloudlet Type 



( 12 ) 
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C. Comparing proposed method with existing methods based 

on energy consumption 

Till now many methods proposed for decreasing energy 
consumption in cloud computing environment. One of the most 
important methods, is the one that proposed by cloudsim 
simulator development group. Beloglazov et al in 2012 [16], 
proposed a method for dynamic integration of virtual machines 
based on virtual machine resource consumption data analyze. 
This method caused decrease in energy consumption with 
service level agreement guarantee. This thesis uses 
combination of several methods for selection of virtual 
machine for migration policy and virtual machine allocation 
policy. Using the combination of local regression minimum 
migration cause decrease in energy consumption and better 
results. This selection policy uses Loess method that proposed 
by Cleveland, for selection of host that it’s consumption is 
greater than threshold. 

For virtual machine selection, this method selects the 
virtual machine that has lowest migration time. Migration time 
estimated based on amount of consumed ram by virtual 
machine divided by amount of host available bandwidth. 
Results of this method compare with proposed algorithm based 
on energy consumption, number of virtual machine migration 
and number of host’s shutdown. Existing and proposed 
algorithm executed in eclipse environment and cloudsim 
simulator. Number of physical and virtual machines differs in 
each experiment. 




Figure 5. Comparison of energy consumption 





Figure 7. Comparison of performance degradation due to migration 

Since in proposed method, virtual machines that sum of 
their execution time and waiting time is greater than their 
defined deadline candidate for migration from the beginning, 
number of unnecessary migrations decrease and because the 
migration process causes overhead on system, in proposed 
method energy consumption decreases. Decrease in energy 
consumption with decrease in number of host’s shutdown 
shows that migration is efficient and has good performance. As 
seen in diagram 3 performance degradation due to virtual 
machine migration to another host and ratio of unplaced virtual 
machines to all migration requests is much lower in proposed 
method. One of the reasons is reduced number of host’s 
shutdown because this function causes performance 
degradation and system overhead. 

D. Comparison of proposed and existing methods based on 

deadline 

Many thesis and researches use deadline in scheduling 
which most of them didn’t use migration. One of this thesis 
proposed in 2014 by hadadian et al [17] that uses scheduling 
based on predefined deadline and balance factor in cloud 
computing environment and results compare with FIFO and 
Round Robin algorithms. The goal of this method is to achieve 
minimum cost and reduce unsuccessful jobs and therefore 
better performance. In this method received jobs sort based on 
their deadline then the best cloud chooses for job execution and 
algorithm again executes in specific intervals in order to make 
changes if necessary. This method increases success ratio. 
Finally the cloud that has maximum balance factor and it’s 
response time consider the deadline, selects as a suitable cloud. 
This method doesn’t consider virtual machine migration and is 
proposed as a future works to improve load balancing. 

This simulation uses 6 different data for execution time and 
deadline. In each of them, there are 10 jobs (tasks) which each 
of them have specific execution time and deadline. We 
compare our proposed method with this method based on 
minimum, maximum and average rate of request loss. 



Figure 6. Comparison of number of migration 
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Figure 8. Comparison of loss rate 

This compare show that average and maximum loss rate in 
proposed method have been improved, i.e. proposed method 
can complete more requests successfully. Proposed method 
considers energy consumption too and uses migration when 
response time in one host becomes greater than deadline and 
for reducing energy consumption. Therefore if the host can’t 
response the received request, virtual machine migration to 
another host with ability to response on time occurred and 
because of migration, request’s loss rate will be reduced. 

VI. Conclusion 

As mentioned earlier, internet and IT world that are vital 
part of human life are growing day by day. Needs of 
community members like information security, fast processing, 
online and dynamic access, power of focusing on 
organization’s projects instead of keeping servers and saving 
costs are very important too. Nowadays solution in technology 
is cloud computing. Cloud computing is service delivery 
instead of product and it uses virtualization technology by 
internet. 

In this thesis we provide request’s response within their 
defined deadline and consider energy consumption as an 
important parameter. If the host can’t response or energy 
consumption becomes greater than threshold, virtual machine 
migration to another host with ability to response virtual 
machine after migration occurred. Energy consumption must 
be lower than threshold after migration. This algorithm will be 
iterated in specific intervals so in every stage of simulation, 
status of virtual machine could be checked and migration will 
happen if necessary. Although this method increases the 
computation overhead, evaluation and comparison of results 
show that this method is much better in energy consumption in 
compare with existing methods and with reducing unnecessary 
migrations it causes reducing in migration overheads. So level 
of reducing performance and cost of migration is lower in 
compare with existing methods. Proposed method can response 
better in defined deadline in compare with presented method in 
[ 16 ] and average and maximum loss rate of requests are lower 
than existing method. 
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Abstract — The present work introduces a new method for 
discriminating electroencephalogram (EEG) power spectra of 
rat's brain housed in different architectural shapes. The ability of 
neural networks for discrimination is used to describe the effect 
of different environments on brain activity. In this research the 
rats were divided into four groups according to the type of 
environmental shapes as: control (normal cage), pyramidal, 
inverted pyramidal and circular. The brain activities (EEG) were 
recorded from rats of each group. Fast Fourier Transform (FFT) 
analysis of EEG signals was carried out to obtain power spectra. 
Two different neural networks are used as classifiers for power 
spectra of the different 4 groups: multi-layer perceptron (MLP) 
with backpropagation and radial basis function RBF networks 
with unsupervised K means clustering algorithm. Experimental 
studies have shown that the proposed algorithms give good 
results when applied and tested on the four groups. The 
multilayer with backpropagation and radial basis function 
networks achieved a performance rate reaching 94.4 % and 

96.67% respectively. 

Keywords: EEG, Architectural shape, artificial neural networks, 
power spectrum, Backpropagation Algorithm, Radial basis 
function network. 

I. Introduction 

Electroencephalography (EEG) is the recording of 
electrical potential difference during brain activity. EEG 
signals are often used for observation and diagnosis of 
neurological disorders [1, 2]. EEG provides information about 
the functional state of the brain more than structural 
information. 

Environment is the combination of external physical 
conditions that surround and influence the growth, 
development and the survival of organisms. Due to the 
consideration of architectural environment that surrounds the 
human being; many trends appear in architecture aiming to 
achieve environmental balance and human comfort [3]. 



Since the energy is the ability to produce an effect, the 
present scientific view of reality supports the idea that we are 
composed of energy fields and the wave is the principle shape 
of energy. All the levels of energy react with each other 
through resonance to produce some kinds of harmony or 
energy balance. This affects the energy balance of human and 
living organisms in the form of waves or radiation energy that 
have specific wavelength and frequency. The human himself 
produces electromagnetic fields that enable him to react with 
any other objects in the surrounding environment that have 
their own frequency or vibration [4] . So we could say that two 
similar geometrical shapes have similar frequency, vibrations 
and motion energy that transfer into the human by resonance. 
The architectural design of a room may create a new medium 
that will have an influence on the physical and physiological 
state of the bodies [5]. 

Research on Pyramids proof some evidence that space 
within the great pyramid and its small replicas intensifies 
and/or generates the energy of electromagnetic radiations and 
other forms of the so-called universal energy [6] . The effect of 
this ‘pyramidal energy’ had been studied on solids [7], liquids 
[8], plants [9], microorganisms [10], animals [11] furthermore 
actually human volunteers. Some of the discoveries of such 
studies include rapid growth of plants, faster healing of bruises 
and burns, longer preservation of milk [12], and an enhanced 
vitalization and better relaxation in human subjects [13]. A 
Various number of volunteers have expressed that meditating 
inside the pyramid was easier than meditating outside as they 
felt more peaceful, more relaxed and less distracted [14]. 

Some Previous studies were carried out to show the effect 
of architectural shapes on rats, exposure to pyramidal 
environment reduced neuroendocrine and oxidative stress and 
increased antioxidant defense in rat [15], provided better 
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wound healing and protection against stress -induced 
neurodegenerative changes [16]. EEG recorded from rats 
housed into different shapes showed an enhancement in alpha 
brain waves for rats housed in pyramidal shape compared with 
other shapes [17]. In this research we will use artificial neural 
networks to discriminate between EEG of rat's brain housed in 
different architectural shapes. 

Artificial Neural Networks (ANNs) are designed to 
perform information processing like a human brain. ANN is a 
mathematical model which performs calculating, decision 
making and learning. It is based on a human nervous system; 
therefore, ANN components simulate neural networks. In an 
ANN, processing element, weight, activation function and 
output correspond respectively to cell body, synapse, dendrite 
and axon in a biological neural network of the brain [18]. 

Artificial neural networks (ANNs) are widely used in 
science and technology with applications in various branches 
of chemistry, physics, and biology [19]. ANNs have proven 
useful in the analysis of blood and urine samples of diabetic 
patients [20], leukemia classification [21] and EEG recording 
for prediction of epileptic [22]. In addition, ANNs have also 
been applied in the diagnosis of colorectal cancer [23], colon 
cancer [24] and early diabetes [25]. 

In the present research, we use two different architectures 
of supervised artificial neural networks, namely multi-layer 
perceptron (MLP) with backpropagation and radial basis 
function RBF networks to discriminate EEG power spectra 
from the four different environmental shapes. 

This paper is organized as follows: in section II, materials 
and methods are described. The two types of neural networks, 
as a classifier, are described in section III. In section IV, the 
results and discussion of the study are provided. Finally, 
conclusions are remarked in section V. 

MATERIALS AND METHODS 

Forty male adult albino rats greater than 150 g were used 
in the present study. Rats were kept in an animal house under 
constant laboratory conditions, fed and provided water for 3 
weeks. The biological clock of the animals was kept as 
normal. 

A. Architectural Shapes of Rats Housing 

As shown in figure 1, the animals were divided into 4 
groups (10 for each): 

• The first group was considered as a control in a wired 

cage. 

• The Pyramidal shape has dimensions of 21.5 cm height, 50 

cm base and 22 cm for each side. The four triangular sides 



of the pyramidal shape are angled upwards at an angle of 
51° with the base. 

• The inverted pyramidal shape has dimensions of 25 cm 
height, 83 cm base and 50 cm side. The four triangular 
sides of the inverted pyramidal shape are angled upwards 
at an angle of 129° with the base. 

• Circular shape has dimensions of 28 cm height, 55 cm 
diameter. 

All the latter 3 geometrical shapes are made of wood with 
a wire mesh door at the top [26] . 




Figure 1 . The classification of different experimental groups 
B. EEG Recording Method 

Rats were anesthetized with sodium pentobarbital (50 
mg/kg, intra-peritoneal) prior to electrode implantation. Four 
epidural permanent stainless electrodes were implanted in the 
skull in positions overlying both motor and visual cortices of 
rat’s brain. A fifth electrode (reference electrode) was 
implanted in the contra lateral crest of the skull (nasal) at the 
interaural line [27]. The five implanted electrodes were 
connected to the amplifier and then to the analogue to digital 
convertor (A/D). 

As shown in Fig. 2 The EEG signal from the rat passes 
through electrodes to jack box, amplifiers, filters, analogue to 
digital converter and finally to the computer display. The 
recording was made several times in order to ensure that the 
EEG recordings are stable and the rat's brain is free from any 
lesions. After recording EEG signals from the rat's brain, FFT 
analysis was carried out for EEG signals to obtain power 
spectra. 

The spectra resulting through this analysis are divided into 
four frequency bands: Delta (8) (1-4 Hz), Theta (0) (4-8 Hz), 
Alpha (a) (8-13 Hz) and Beta (p) (14-30 Hz) [28, 29]. A 
calculation of the mean power of amplitude (pV 2 ) for EEG 
signals for each band was performed [29]. The recorded EEG 
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signals were grouped into four main categories based upon the 
environmental shape; (1) pyramidal, (2) inverted pyramidal, 
(3) circular and (4) control. 




Subject in Faraday Cage 

Figure 2. Block diagram of the main functional components of the EEG 
recording system. 



C. Input data preparation 

Each rat had 4 electrodes connected to it. This comprised 4 
mono-polar EEG channels as well as 4 bipolar EEG channels. 
Data from each of the 8 channels was divided into four 
different frequency bands delta, theta, alpha and beta. So far 
we obtain 32 bands from each weekly recording. Data from 3 
weeks presented as a vector of 96 data points. Forty samples 
from the four groups of length 96 data point are presented to 
the neural network for classification. 

II. Artificial Neural Networks 

Artificial Neural Networks (ANNs) are the interconnection 
of simple processing nodes which functionality is modeled 
from the neurons in the brain. The ANN consists of an input 
layer, an output layer and at least one hidden layer. The input 
values to the network are fed from the input layer passing 
through the hidden layer to the output layer. The input values 
are forwarded to the nodes in the hidden layer. The values 
obtained as inputs are processed within hidden layer and 
forwarded to either the nodes of the next hidden layer or nodes 
of the output layer. 

The processing capacity of the network is determined by 
the relative weights of the connections in the network. 
Supervised learning is the process of changing the weights of 
the links between the nodes of the network to map patterns 
presented at the input layer to target values of the output layer. 
This is done using training algorithms. 

In the following subsections two different types of neural 
networks are presented to be used for the classification of EEG 
signals from different architectural environments [30] . 
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A. MLP with BP Algorithm Network 

In this subsection the structure of the MLP neural network 
with the backpropagation (BP) algorithm used to adjust its 
parameters for classification purposes is introduced. 

1 ) Multilayer Perceptron 

The multilayer neural network architecture can have any 
number of layers. Fig. 3 illustrates network architecture with 4 
layers, the first layer is the input layer and the last layer is the 
output layer. In between first and last layers are second and 
third layers which are the hidden layers. In this work two 
hidden layers are used to classify EEG of the 4 different shapes 
represented by features vector. 




Figure 3. Multilayer perceptrons architecture. 

2 ) Backpropagation Algorithm 

Back propagation algorithm is the most popular in the 
supervised learning architectures. It is considered a 
generalization of the delta rule for nonlinear activation 
functions. 

Learning of the MLP by Backpropagation algorithm has 
two phases. First, a training input pattern is presented to the 
input layer. The network propagates the input pattern from 
layer to layer until output pattern is generated by the output 
layer. If this pattern is different from the desired output, an 
error is calculated and then propagated backward through the 
network from the output layer to the input layer. The weights 
are modified as the error is propagated. The backpropagation 
training algorithm is an iterative gradient descent approach 
designed to minimize the mean square error between the actual 
output of multilayer feedforward perceptron and the desired 
output. It requires continuously differentiable non-linearity. We 
assume a sigmoid logistic nonlinearity as a transfer function of 
the network nodes. 

Following is the summary of the BPN algorithm: 

Step 1: Initialize weights and offsets. 

Set all weights and node offsets to small random values. 

Step 2: Present input and desired outputs. 

Present a continuous valued input vector x 0 , x 1? x N4 
and specify the desired output d 0 , , . . . , d M4 . If the network is 
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where d is the desired output of node j and y j is the actual 



used as a classifier then all desired outputs are typically set to 
zero except for that corresponding to the class the input is from. 
That desired output is +1. The input could be new on each trial 
or sample from a training set could be presented cyclically until 
stabilization. 

Step 3: Calculate actual output. 



output. 

If node j is an internal hidden node then 

S j = x j(l- x j)Y J S j w jk ( 8 ) 

k 



First calculate the formulas 



I.=f hl 
j J j 

Z =f h 2 

J Jr 



S *#*,.+*" 

i 

2 > H Nj +b r 2 



( 1 ) 

( 2 ) 



Where k is over all nodes in the layers above node j . 
Internal node thresholds are adapted in a similar manner by 
assuming they are connection weights on links from auxiliary 
constant- valued inputs equal to 1 . 

Step 6: Repeat by going to step 2. 



Where x t is the i th network input, w^j is the connection 
weight from the ith input to the j th neuron in the 1 st hidden 
layer, w h2 is the connection weight from the j th neuron in the 

1 st hidden layer to the r th neuron in the 2 nd hidden layer, b is 
the weight from the bias to the j th neuron in the 1 st hidden 
layer, b ^ is the weight from the bias to the r th neuron in the 



B. Radial Basis Function Neural network 

In this subsection the structure of the RBF neural network 
and the algorithms used to adjust its parameters for 
classification purpose are introduced. 

1 ) Structure of RBF neural Network 



second hidden layer, f ^ (•) and f^ 2 (•) are nonlinear 
sigmoid activation functions defined as 

finetinput) = — (3) 

1 + e ~netinput 

The network output is calculated by the following equation 



y(k + ]) = f k ° 



5>gz r +&? 

. k 



(4) 



where w° r is the weight connection of the k th neuron in 
the output layer to the r th neuron in the 2 nd hidden layer, b° 

is the bias weight for the k th output neuron, and f° is the 

transformation function between 2 nd hidden layer and output 
layer which is a linear function. 

Step 4: compute the error signal 



e j (k) = d j (k)-y j (k) (5) 

Step 5: Adapt weights. 

Use a recursive algorithm starting at the output nodes and 
working back to the first hidden layer. Adjust weights by 

w t j (k + 1) = w tj (k) + pSj x t (6) 

In this equation w 7 (fc) is the weight from hidden node 1 
or from an input to node j at time k , , is either the output 

of node 1 or is an input, lj is the learning rate, and & j is an 
error term for node j . If node j is an output node then 

s j =id j-y.C 



Fig.4 shows a typical RBF network, with 4 inputs 
0 1? ...,x q ), and p outputs (j 1? y p ) . The hidden layer 

consists of h computing units connected to the output by h 
weight vectors x , . . . , oc H ) . 




Input Layer Hidden Layer Output Layer 

Figure 4. RBF Network architecture. 



Response of one hidden unit to the network input at the i 
instant, x . , can be expressed by 



(*,) = exp 



K) 2 



PC. — LI 
\\— 1 Llk 



, (k: l...h) (9) 



where ju l is the center vector for k th hidden unit at i 



; th 



instant, g\ is the width of the Gaussian function at that time, 
and || . || denotes the Euclidean norm. The overall network 
response is given by 
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IQ. 1) Choose a set of centers {ju,p } arbitrarily and 

give the initial learning rate y(0) =1 . 
the 

2) Compute the minimum Euclidean distance 



connecting weight vector of the k* hidden unit to output 
layer, which is in the vector form of 
a[ = [ a \ k , . . . , a \ k , . . . , a l pk ] T . Thus the coefficient matrix of the 



L,.(fc) = |x(fc)-^.(£-i)| 

r = arg|min (k) | 



i : 1 h 



( 11 ) 



network can be expressed as 





a \ i 


. . . a lk . . . 


1 

V : 


II 

X 


a n 


... a lk ... 


a ih 




y P i 


... a l pk ... 


a‘ P h_ 


and the bias vector is a l 0 = 


il 
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£ 
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2 ) Training Radial Basis Function Network 
Training RBF neural network consists of determining the 
location of centers and widths for the hidden layer and the 
weights of the output layer. It is trained using a two -phase 
approach: in the first phase, unsupervised learning occurs, 
which its main objective is to optimize the location of center 
and width. In the second phase, the output layer is trained in a 
supervised mode using the least mean-square (LMS) algorithm 
to adjust the weights so as to obtain the minimum mean square 
error at the output. The following are the three steps of the 
hybrid learning method for an RBF neural network, and they 
are discussed in more details in the next three subsections: 



3) Adjust the position of these centers as follows: 
ju.(k) = ju.(k-l) + y(k){x(k) -ju.(k- 1)) (i = r) 

( 12 ) 

= £,(*- 1) G*r) 



4) k = k + 1 , y(k) = 0.998 y(k — 1) and go to 2. 



ii. Width Calculation 

After the RBF centers have been found, the width is 
calculated. The width represents a measure of the spread of 
data associated with each node. Calculation of the width is 
usually done using the P-nearest neighbor algorithm. A number 
P is chosen and for each center, the P nearest centers are found. 
The root-mean squared distance between the current cluster and 
its P nearest neighbors is calculated, and this is the value 
chosen for a . So, if the current cluster center is p . , the value 

of width is given by: 






(13) 



A typical value of P is 2, in which case a is set to be the 
average distance from the two nearest neighboring cluster 
centers. 



i. Find the cluster centers of the radial basis function; 
use the k-means clustering algorithm (calculation of 
center). 

ii. Find the width of the radial basis function using P- 
nearest neighbor. 

iii. Find the weights; use FMS (weight estimation). 



Hi. Weight Estimation 

Teaming in the output layer is performed after calculation 
of the centers and widths of the RBF in the hidden layer. The 
objective is to minimize the error between the observed output 
and the desired one. It is commonly trained using the FMS 
algorithm [32] and is summarized as follows: 

Training sample: Input signal vector = O(k) 



i. Calculation of Centers 

To calculate the centers of the radial basis function we use 
the k-means clustering algorithm. The purpose of applying 
the k-means clustering algorithm is to find a set of clustered 
centers and partition the training data into subclasses. The 
center of each cluster is initialized to a randomly chosen input 
datum. Then each training datum is assigned to the cluster 
that is nearest to itself. After training data have been assigned 
to a new cluster unit, the new center of a cluster represents 
the average of the training data associated with that cluster 
unit. 

When all the new centers have been calculated, the 
process is repeated until it converges. The recursive k- 
means algorithm is given as follows: 



Desired response = dfk) 

User-selected parameter: 0 < r/ < 1 
Initialization: Initialize the weights w(0) . 

Computation: For k = 1, 2, . . .. Compute 

efk) = dfk) - w T (/v)0(/v) v 

w(k + 1) = w(k) + 77 O(k) e(k) . 

C. Training and Testing of the Two Different Neural 
Networks 

In this part, Multilayer perceptron with backpropagation 
and radial basis function with k means clustering algorithm are 
used to solve classification problem. The two different NNs are 
programmed using C++ programming language [33]. The input 
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layer for both neural networks consists of 96 source nodes as 
mentioned before. The Initial weights for the two different 
neural networks are taken randomly from the interval [0, 1]. 
The value of the learning rate for the two neural networks was 
chosen experimentally in the range of (0.001-0.9) as we will 
seen in the next section (IV) [34-36]. Choosing the number of 
the hidden layer nodes for MLP and RBF is shown in the 
section (IV). The number of output layer nodes depends on 
shapes needed to be discriminated. In our investigation, the 
EEG power spectra of four different shapes are used as 
samples. The corresponding desired outputs in both MLP and 
RBF are (0,1, 0,0), (1,0, 0,0), (0,0, 1,0), (0,0,0, 1) for control, 
pyramidal, circular and inverted pyramidal group respectively. 
The number of samples is 220 EEG power spectra for all 
shapes. Forty samples have been used as training data for the 
neural networks while 180 samples have been used for testing 
the neural networks. 



III. RESULTS AND Discussion 

Before applying the two models of ANN for the 
discrimination of 40 samples of EEG power spectra of the 
different shapes, the network needs to be trained to optimize 
the network performance. For MLP with BP, different learning 
rates are taken in the range (0.001-0.9). Figure 5 shows the 
effect of different learning rates on the performance of MLP 
with backpropagation. The learning rate of 0.01 gives the best 
performance for NN. The effect of the number of hidden 
neurons for MLP is presented in Table I. With a fixed number 
of iterations of 100000, 60 first layer hidden nodes and 30 
second hidden layer nodes resulted in the best performance 
compared with other combinations of hidden nodes. 




Figure 5. Infleunce of different Learning Rates on MLP neural network 



For RBF network with k means clustering algorithm, Figure 
6 shows the effect of different learning rates on the 
performance of network. The best value of the learning rate 
was 0.08. Table II shows the effect of the number of different 
RBF clusters with a fixed iteration number of 100000, the best 
performance was obtained when the number of clusters is 4. So 
a RBF network with 4 hidden layer nodes and the learning rate 
equal to.08 is used for testing. 
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TABLE I. Effect of the Number of Hidden Nodes on Performance 
OF ANN (TESTING RESULTS) 



No 

of 

Nodes 
in 1st 
hidden 
layer 


No 

of 

Nodes 
in 2nd 
hidden 
layer 


Accu 

racy 

Training 
Control 
group % 


Accur 

acy 

Training 
Pyramidal 
group % 


Accur 

acy 

Training 
Inverted 
pyramidal 
group % 


Accu 

racy 

Training 
Circular 
group % 


20 


10 


92.06 


88.45 


88.37 


93.45 


30 


15 


95.43 


84.34 


90.13 


94.44 


40 


20 


89.40 


91.16 


92.78 


94.00 


50 


25 


83.53 


85.44 


91.31 


93.93 


60 


30 


98.13 


97.42 


99.13 


96.08 


70 


35 


93.10 


86.51 


86.18 


93.67 




0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 

value of learning rate 



Figure 6. Effect of different Learning Rates on RBF neural network 



TABLE II. Effect of the Number of clusters in Hidden 
layer on Performance of ANN 



No of 
clusters 
in hidden 
layer 


Accurac 
y Control 
group % 


Accuracy 
Pyramidal 
group % 


Accuracy 

Inverted 

pyramidal 

group% 


Accurac 
y Circular 
group% 


4 


97.06 


96.34 


95.38 


97.13 


10 


93.79 


86.07 


88.16 


98.30 


30 


81.85 


96.05 


42.23 


92.25 


50 


91.87 


95.71 


83.60 


94.39 


70 


84.05 


75.29 


89.64 


98.26 
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After training both networks, selected test sets are used (50 
from control group, 50 from pyramidal group, 40 from inverted 
pyramidal group and 40 from circular group). It is found that, 
the RBF neural network performed best with average 
classification accuracy of 96.67% with respect to the number of 
correctly classified instances (table IV), it was followed by 
MLP neural network with backpropagation algorithm with a 
classification accuracy of 94.44% as shown in Tables III. 



TABLE III. Performance of MLP in Classification of EEG Power 
Spectra from the four different shapes 



subjects 


No of 
tested data 


No of 
correct 


No of 
incorrect 


Correct 

% 


control 


50 


47 


3 


94 


Pyramida 

1 


50 


46 


4 


92 


circular 


40 


39 


1 


97.5 


Inverted 

pyramidal 


40 


38 


2 


95 


Total 


180 


170 


10 


94.44 



TABLE IV. Performance of RBF in Classification of EEG Power 
Spectra from the four different shapes 



subjects 


No of 
tested 
data 


No of 
correct 


No of 
incorrec 
t 


Correct 

% 


control 


50 


49 


1 


98 


Pyramida 

1 


50 


48 


2 


96 


circular 


40 


39 


1 


97.5 


Inverted 

pyramidal 


40 


38 


2 


95 


Total 


180 


174 


6 


96.67 



IV. CONCLUSION 

In present study we have developed artificial neural 
networks for classification of EEG signals obtained from rats 
living in different architectural shapes. Two neural networks 
are used for the classification process which are RBF and MLP 
with backpropagation neural networks. A MLP with 60 nodes 
in the 1st hidden layer and 30 nodes in the 2nd hidden layer are 
used. The learning rate is given by r|=0.01 . RBF has 4 hidden 
layer nodes and the learning rate used is r| = 0.08. From the test 
results using 180 EEG power spectra for environmental shapes, 
we found the best classification accuracy for RBF and MLP 
NNs are 96.67% and 94.46% respectively. 
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Abstract — As the adoption of Information and Communication 
Technology (ICT) tools in production and service rendering 
sectors increases, the demand for digital data storage with large 
storage capacity also increases. Higher storage media systems 
reliability and fault tolerance are among the key factors that the 
existing systems sometimes fail to meet and therefore, resulting 
into data loss. Forward error correction is one of the techniques 
applied to reduce the impact of data loss problem in digital data 
storage. This paper presents a survey conducted in different 
digital data storage companies in Dar es Salam, Tanzania. Data 
were collected and analyzed using Statistical Package for Social 
Sciences (SPSS). Secondary data were captured from user and 
manufacturer technical reports. It was revealed that data loss is 
still a predominant challenge in the digital data storage industry. 
Therefore, the study proposes the new storage media FEC model 
using locked convolutional encoder with the enhanced NTC- 
Viterbi decoder. 

Index Terms — Storage Media, FEC, NTC, Viterbi, RS 
I. INTRODUCTION 

A storage media is any device on which data or information 
can be electronically stored, kept, and retrieved when needed 
[1, 2]. Storage media are devices that store user information 
(data) and application. The media can be categorized as optical 
data storage, magnetic hard disk drives, magnetic tape drives 
and flash disk drives [3]. The prevention of the record, 

information, and data stored electronically and the ability to 
retrieve them later or in future, require more than the safe 
keeping of the storage media [4]. There is little published 
scholarly work regarding the failure pattern of storage media 
and the key factors that affect their lifetime. Most of the 
information reported comes from storage manufacturers 
themselves. Since there is a huge amount of information stored 
and transferred among many storage media, data loss due to 
disk failure is a major issue that affects the reliability of the 
system. Reliability, fault tolerance and performance of storage 
media are the biggest concerns [3]. The demand for storage 
media increases everyday and it is estimated that over 90% of 
all the information and data produced in the world is being 
stored on magnetic media, and most of them are stored on hard 
disk drives [5]. On the other hand, data storage industry faces 
technological challenges due to the increase in demand and 
consistency. Currently, digital data have become one of the 



most important parts nowadays and create the increasing 
demand of data storage systems[6]. The world has entered the 
information era and any area of life is immersed in information. 

The demand for digital data storage with large storage 
capacity, higher reliability and fault tolerance, easier 
accessibility, better scalability and cheap management poses a 
remarkable challenge on the storage industry [7]. Essentially, 
the demand for data storage becomes more and more growing 
and data storage system needs to have high density, short 
access time, and fast input and output transfer rate [6]. 
International Data Corporation (IDC) shows that data are 
expanding at approximately 50% to 80% per year and in other 
places the growth rate is closer to 100% per year. Some of the 
catalysts for this growth include databases, Enterprise 
Resource Planning (ERP), Supply Chain Management, E- 
procurement, Content Management, Data Mining, Customer 
Relationship Management (CRM), Electronic Document 
Management (EDM), emails, social media and multimedia[8, 
9]. Seemingly, the demand for bigger, and faster memories has 
led to significant improvement in the conventional memory 
technologies like hard disk drives and optical disks. However, 
there is a strong proof that these two dimensional storage 
technologies are approaching their fundamental limits[2]. 
Almost all vital data are now stored on external disk storage 
subsystems. An average usable capability is approximately 
2.18 peta bytes which is up to 12.8% year after year. Factors 
such as growth in storage requirement, larger capacity disks 
and subsystems, and affordable racing have led to larger 
storage configuration [3]. Microsoft has introduced the 
functionality of storage pools and storage spaces to Microsoft 
Windows server 2012 and Microsoft Windows 8. This shows 
that we still need to have an improvement in the storage 
media[10, 11]. Increasingly, it is not viable to talk about a 
storage media these days without talking about solid state or 
flash storage [12, 13]. The use of portable storage media has 
increased, thus, there is a need for improving the security 
mechanism for such devices. A secure solution that provides 
more security and user convenience is needed to avoid direct 
data loss and compromising the security of that data[14]. 

More or less, one and all who have suffered data loss know 
the importance of reliable data storage. The error on a storage 
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device might be random or burst. There are some common 
causes of error in the storage devices. These error sources 
could be a scratch on the disc, the error from read/write failure, 
dust, or controller [7, 15]. It was observed that some of the 
challenges include managing storage growth, lack of skilled 
cloud technology professionals, lack of knowledge about 
forward error correction in storage media, designing and 
managing backup recovery and archive solution, and finally 
conversing the management to adopt the cloud. Figure 1 shows 
the distribution of hard disk problems at the Internet archive. In 
Figure 1, storage media problems such as disk failure, disk 
error and other disk subsystem problems consist of 60% of all 
the recorded hardware problems [16]. The common problem is 
the disk failure which covers 42% of all problems. 



lost 

installation failure(l7%) 



black screen{3%) 
boot problem! 1 %) 
05 efTorp%) 

F5 enor{4%) 

network 1%) 

power supp!y{6%) 

IDE failure!™ 
iraenor)3%f 

RAID{7%) 




disk failure(42%) 



disk error! 10%} 



Figure 1: Hardware Failures Distribution at Internet Archive: Source : 
[16]. 

The recent research conducted in some companies in Dar es 
Salaam shows common error facing the storage media. In 
Figure 2, the recent study findings reveal that the most 
common error is the disk failure. 




Figure 2: Common Storage Media Error for Different Companies 
Reliability and performance of storage media are the biggest 
concerns. This implies that there is a need for the research to 
improve storage technologies in the existing systems. 

II. STATE OF THE ARTS 

In Spite of the problems discussed in the previous 
section, digital data storage industry applies defferent 
mechanisms such as backup and recovery systems to avoid data 
loss during failures. This section discusses efforts undertaken . 



A: Backup System 

Backup is the process of copying files or databases in order to 
preserve data in case of equipment failure or any other 
calamity. Backup or archiving data are later used to recover 
systems to their original state after failure. Backup has two 
main goals. The primary goal is to recover data after a loss by 
either deletion or corruption, whether intentionally or 
accidentally. The secondary goal is to restore the system to its 
earliest condition. There are three common types of backup 
systems which are full backup, incremental backup, and 
differential backup. In full backup, all the files are backed up 
every time you run a backup system. Incremental backup 
brings back the files that have changed since the last backup 
was done. Finally, differential backup provides backup of the 
files that have changed since a full backup was performed. 
However, different studies indicate that the pace of data 
growth is very high per year and even higher for the 
companies having large data and intensive application or 
distributed data centres. This means all data need to be backed 
up and duplicated. This fact makes the system more expensive 
and time consuming with this solution[17]. The crucial data 
should always be backed up to protect them from loss or 
corruption. Saving only one backup file sometimes might not 
be enough to preserve the information. In same vein, in order 
to safeguard crucial data, one needs to keep at least three 
copies of any such data, that is, one primary file and two 
backups. The two backup files are supposed to be kept in 
different media storage types to protect different types of 
hazards and making sure one file is kept offsite[ 18-20]. The 
importance of sensitive data to be backed up cannot be 
avoided because data storage media reliability can be low. 
With respect to this scenario, new techniques are introduced 
from time to time to secure vital data. For example, Oracle 
introduced sun ZFS appliance for Oracle Exadata Database 
which can restore up to 7TB/hour[21, 22]. Disaster planning 
would not be necessary if we were sure that nothing will ever 
happen in our environment such as hardware problem, viruses, 
and user errors. However, if a disaster is anticipated to occur, 
making backup and recovery systems is important to any vital 
data stored into the system. Essentially, the goal here is to 
create a way that crucial data will always be restored within 
acceptable time frame, within the budget, and without 
unnecessary shock on normal day to day activities [23]. 
However, one has to remember that it is very expensive 
because to back up data one needs extra devices and time to 
make backup, especially when there is large data. 

B:Recovery System 

Data recovery is a process of restoring or rescuing 
unintentionally deleted, corrupted, or inaccessible data to the 
storage media. Normally, this action is performed when we 
have physical damage or logical damage of the storage 
devices that prevent the file system of the devices to be 
mounted by the host system[24, 25]. Data recovery was 
introduced since 1974 when the magnetic data storage devices 
were commercially introduced into the market. And this was 



33 



https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 13, No. 12, December 2015 



after the users discovered the vulnerability of their data stored 
on those devices and quickly to overcome this problem data 
recovery was introduced to protect the data stored on those 
devices[8]. Any event that causes an interruption or 
destruction in the daily operation or processing for a period of 
time and affects that operation is called a disaster. This 
disaster should be addressed to recover the operation intended 
and this can be done by the disaster recovery which must 
ensure the continuation of all processes in the event when a 
disaster occurs [26]. In the world where we live today, tera 
bytes or peta bytes of data are not enough for storing large 
chunks of databases and therefore data recovery systems 
become more challenging especially in a distributed 
system[27]. The importance of a recovery system cannot be 
ignored and that is why even in the application such as Oracle 
you will find that the application for recovery system is 
included; for example, Oracle Recovery Manager (RMAN) 
utility from Oracle that was designed for online backup and 
recovery of Oracle database files [2 8]. Due to huge increase in 
the electronic data, large volume of storage devices is 
required to store those data. Currently, consumers prefer to 
store their data in cloud computing. However, if the cloud gets 
corrupted or damaged, then the consumers lose their important 
data. This led to the introduction of some mechanisms to back 
up data so that they could be restored at any time when the 
cloud fails. There is a technique like the plain data backup 
though it has many security and reliability problems. To 
overcome plain data backup and recovery problems, we can 
use a system like Redundant Array Independent Disk (RAID) 
[29]. Again we can recover data from Window search in 
which we obtain the record from search database either via 
carving or via extensible storage Engine API which provides 
a potential source of evidence about the files that we cannot 
access and the reason why it is not accessible [30]. The cost for 
keeping data safe could be taken into consideration because it 
has some financial implication. In this case, a company or an 
organization must decide on the best practices that will 
support the solution in such a way that it does not compromise 
the financial aspect[8, 9]. Data recovery is one of the fields 
whereby once you make a mistake it might lead you to 
irrecoverable data. So it is very important that you perform 
operations which you are familiar with and do not do any 
action which you are not familiar with. The importance of data 
recovery cannot be ignored as it is very significant for our data 
and that is why even in the Window system, the introduction 
of data recovery is always there. Remarkably, it is very 
expensive to have these systems, even in the case of 
softwares, some are very expensive. As long as people still 
lose their important data, something has to be worked on 
regardless the presence of these other technologies to recover 
the data when something happens to our systems. 



C: FEC Associated 

Forward Error Correction (FEC) is a digital signal processing 
technique used to improve data reliability [3 1 ] . FEC makes the 



use of error correction codes to detect and correct errors 
automatically at the receiver. Error correction coding is the 
way whereby errors which may be introduced into digital data 
can be corrected based upon receiving data.. Error detection 
coding is the means whereby errors can be detected based 
upon receiving the data. Collectively, error correction and 
error detection coding are error control coding[32, 33]. 
Forward Error Correcting (FEC) codes grant algorithms for 
encoding and decoding data bits, and help to accomplish data 
rates nearer to Shannon’s Limit[34]. Similarly, error coding is 
a method of providing reliable digital data transmission and 
storage signal to noise ratio [3 5]. Error coding is used for fault 
tolerant computing in computer memory, magnetic and optical 
data storage media, satellite and deep space communications, 
network communications, cellular telephone networks, and 
almost any other forms of digital data communication[36, 37]. 
The design of efficient error correcting codes needs a 
complete understanding of the error mechanisms and error 
distinctiveness [38]. There are several types of codes. The first 
major classifications are linear and non linear. A linear code is 
an error correction code for which any linear combination of 
codes is also a codeword. Linear codes are partitioned into 
block codes and convolutional codes. Linear codes allow for 
more efficient encoding and decoding algorithm than other 
codes [39]. 
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Figure 3: Error Correction Block 

The linear codes are encoded using the method of linear 
algebra and polynomial arithmetics. If C is a linear code that a 
vector space over the field F has dimension k, then we say that 
C is an (n,k) linear code over F, or an (n,k) code, in short. 
Since linear codes allow for more and efficient encoding and 
decoding algorithm, then we will focus on linear codes and 
these are block codes and convolutional codes. Increasingly, a 
convolutional code operates on streams of data bits 
continuously, inserting redundant bits used to detect and 
correct errors. Convolutional codes are extensively used for 
real time error correction[40]. On the other hand, Block codes 
data are encoded in discrete blocks but not continuously. 
Convolutional codes are processed on a bit by bit basis. They 
are particularly suitable for the implementation in hardware, 
and the Viterbi decoder is one of the best algorithms in 
convolutional codes that allow optimal decoding. 
Additionally, Block codes are processed on a block by 
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block basis and the best algorithm is the Reed Solomon codes 

[41] . There are different algorithms which are used for error 

correction in the storage media. Currently, most of the error 
correction in the storage media is done by block codes and this 
is because block codes have the power of correcting burst 
errors. The convolutional codes are more powerful than the 
block codes besides there more computational complexity than 
block codes. Mrutu and his colleagues showed that Viterbi 
algorithm decoder has less computational complexity than 
other convolutional decoders [42]. Furthermore, the 

introduction of non transmittable code words technique to 
assist Viterbi decoders has enabled it to overcome bust errors 

[42] . This fact motivated the researcher to find out the 
effectiveness of the technique in storage media. 



Reed Solomon. 

Reed Solomon (RS) is an error correcting code that addresses 
multiple error correction especially burst errors in the storage 
media (hard disk drives, CD, and DVD), wireless and mobile 
communication units, satellite links, and digital 
communication. Reed Solomon is among the best block 
coding techniques of which the data stream to be transmitted is 
broken down into blocks and redundant data is then added to 
each block. The size of the block and the amount of check data 
added to each block is either specified for a particular 
application or can be user defined for closed system[43, 44]. 
RS codes are non binary cyclic codes with symbols made up 
of m bit sequences, where m is any positive integer having 
values greater than 2. Reed Solomon are block codes which 
are represented as RS (n,k) where n is the size of the code 
word length and k is the number of data symbols. The size of 
the code word length n is given by 2t + k where 2t is the purity 
symbol. 
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Figure 4: Structure of Reed Solomon 

K=data symbol 
2t=parity symbol 
N=symbol for RS code 

Reed Solomon (RS) code was founded by Irving S. Reed and 
Gustave Solomon who by then were staff members at MIT 
Lincoin Laboratory. RS t was subsequently presented to the 
world on paper in 1960[45]. The RS code construction is 
based on the Galois Field (GF (2 W )) operation for W is 
positive integer. To encode Reed Solomon, we split the 
polynomial that representing the message by an irreducible 
generator polynomial, then the reminder is the RS code which 
we attach to the original message. The first commercial 
application for RS appeared in 1982 in compact disk (CD). 



This was the first storage media device that used these error 
correction codes mechanisms [46]. RS illustrates a systematic 
way of building codes that may well detect and correct 
multiple random errors for different applications and they have 
strong burst and erasure error correction capabilities [47]. Also 
RS codes are able to correct multiple errors up to t errors, and 
they can be extended to correct errors up to 2t errors, just 
making sure that up to t errors the positions are known [48]. 
RS codes make them especially suited to the applications 
where burst error occurs and when there is a single bit error. 
This is because it does not matter to the code how many bits in 
a symbol are incorrect, if multiple bits in a symbol are 
corrupted, it only counts as a single error. This means for 
single bit errors, RS code is usually a poor choice. Seemingly, 
RS operates in two sides, in the encoder and in the decoder. 
The code operates on 8 bits and the number of symbols in the 
encoder block is n=2 m -l. The generator polynomial generates 
redundant symbols and they are appended to the message 
symbol. The decoder side is the one which identifies the 
location and magnitude of the error. The same generator 
polynomial is used to do the identification and the correction 
is applied to the received code. 

Fundamentally, RS is suitable for much application because it 
has very high coding rate and low complexity. Besides being 
suitable for storage media, we still face problems in our 
storage media, and this alerts the researchers that there is a 
need to work more in improving the error correction 
mechanisms in the storage media. The Reed Solomon coding 
currently remains the optimal code for the smallest storage 
systems[49]. Generally, Reed Solomon is considered not very 
scalable[7]. Altogether, Reed-Solomon coding employs the 
same methodology. There are n data words and m element 
column vector. The product is an n data words + m element 
column vector representing the coding words [45, 50]. 

Convolutional Codes 

Convolutional code is one of the error correcting codes which 
generate parity symbols through the sliding application of 
Boolean polynomial function G(z) to the data stream. The 
sliding application is the representation of the convolution and 
its nature facilitates trellis decoding. More often Convolutional 
codes are termed as continuous. However, they have arbitrary 
block length, rather than being continuous, since most of the 
encoding is performed on blocks of data. Convolutional codes 
were introduced in 1955 by Peter Elias[51-53]. The main 
challenge in the convolutional codes is to find out the method 
for constructing codes of a given rate and minimizing their 
complexity [54]. Convolutional codes belong to FEC which 
makes use of the algorithm to automatically detect and correct 
error. They are one of the powerful error correcting codes that 
have a lower code rate (k/n). The common modulation scheme 
which is used is BPSK and QPSK, and the redundant bits are 
used to determine the error. The convolutional codes have two 
parts; one is the convolutional encoder and the second is the 
convolutional decoder. In the encoder, it uses encoder 
parameters which are n, k and m codes. Parameter 4 k’ is input, 
‘n’ is output, and ‘m’ is memory. Usually k<n and ‘m’ must be 
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large to achieve low error probability. The value of these 
encoder parameters can range from 1 to 8 for k and n, and 
from 2 to 10 for m[53, 55]. The decoder part is very important 
because the performance of the convolutional code is 
determined by the decoding algorithm and distance properties 
of the code, and that is why it is very important to determine 
the best decoding algorithm. For example, for small value of 
k, the Viterbi algorithm is commonly used. Viterbi decoder is 
one the most widely used decoding algorithm[53] 

III: Methodology 

It is through a combination of literature review, technical 
reports, and data obtained from selected employees in data 
storage sections through interviews, in some companies 
operating in Tanzania that we could assess the awareness of 
data storage techniques and error correction mechanisms. The 
primary data for this study were collected from digital data 
storage companies in Dar es Salaam, Tanzania. Secondary 
data were also collected from published papers and industrial 
technical reports. In this study, seven companies were 
surveyed to identify the kind of storage media they use, the 
reliability and failure rate, the main course of error or failure 
of storage devices, the possible measures to be taken when an 
error or failure occurs, the possible solution being suggested to 
avoid data loss when an error or failure occurs, and the future 
plan to avoid error in the storage media. The collected data 
were analyzed using Statistical Package for the Social 
Sciences (SPSS). 

IV. RESULT AND THE PROPOSED MODEL 

Analysis of reliability and failure behavior of the storage 
media from different companies from Dar es Salaam was 
conducted. All possible measures taken when a disk with error 
fails and the suggestion for the improvement for solving the 
problem were collected and analysed. When conducting a 
survey, the discussion with IT professionals working in the 
data storage departments was conducted. It was observed that 
most of the existing data storage media in the existing systems 
use Read Solomon [43, 44, 56]. However, the survey shows 
that data loss is still a problem in the industry. Therefore, there 
is a need of improving the algorithm to minimize the residual 
error during data reading process from digital storage media. It 
was also perceived that most users of digital storage media are 
not aware of FEC technologies in the media. In brief, the 
following are the observations from the literature review and 
data collected from different companies in Dar es Salaam. 

• There is little scholarly published work on the error 
and failure pattern of the storage media and the key 
factors that affect their lifetime. Most of the 
information comes from storage manufacturers 

• In spite of having RS codes for Forward Error 
Correction in storage media, the industry is still 
facing problems of data and mostly disk error and 
disk failure. 

• Most of the error correction in the storage media is 
done by using RS codes which work best on Optical 
disk drive. 



• RS is relevant in correcting burst errors and not 
appropriate in correcting single error because the 
errors are corrected in blocks so it is the wastage of 
resources. 

• Convolutional codes are more powerful than Block 
codes; however, they are not preferred as a solution 
due to their computational complexity which causes 
results of data processing to delay. 

• The introduction of Locked convolutional codes with 
Non Transmittable Codewords (NTC) at the decoding 
side can solve the problem of computational 
complexity in Viterbi decoder. 

• Majority of the people are not aware and are not 
interested in improving the Forward Error Correction 
codes for storage media; rather they are interested in 
improving the backup systems and data recovery 
software. 

• The hard disk drive is one of the storage media, 
which is mostly used to store data. 

The following results in figures 5, 6, 7 and 8 are obtained from 
the data collected in seven companies in Dar es Salaam. 

Figure 5 displays the reliability of storage media in some 
companies in Tanzania. The reliability ranges from daily to 
weekly basis in most of them. This demonstrates that the 
reliability is very low and that is why the backup is done on 
daily and weekly basis and very few of them do it yearly. 




Figure 5: Reliability of storage media for some company in Tanzania 

Figure 6 shows the failure rate of storage media in the 
surveyed companies. The failure rate is indicated for five 
years in the majority of the companies. This response to the 
findings indicated the reliability. If the reliability is low, then 
the failure rate is high. 85% revealed that the failure rate is not 
more than five years and only 14% is showed more than five 
years. 
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Figure 6: Failure Rate of Storage Media for Some Company in Tanzania 
Figure 7 illustrates that most IT professionals in the industry 
think that the best way to improve data storage system is 
through backup system and few through data recovery and 
restore system. It was revealed that 71.43% prefer to improve 
backup system, 14.29% prefer to improve on data recovery 
software and improve other techniques respectively. But 0% 
prefers the improvement of Forward Error Correction. The 
findings imply that most people are not aware of the 
importance of improving these algorithms. However, 
improving these algorithms can improve the reliability of the 
storage media. In this case, if deploying a backup system is 
very expensive then having a reliable storage system can help 
a lot, especially for those who do not store large volumes of 
data. Similarly, the individual case will also be improved as 
we all know that this issue can go down to individual level. 
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Figure 7: How to improve data storage loss 



Figure 8 demonstrates that in most cases when there is data 
loss, very often the data are successfully recovered, and this is 
through backup system and data recovery software. But the 
analysis revealed that not always all the data are recovered. It 
is only 14.29% who are sure that the data will always be 
recovered and 85.71% remarked that most of the time they can 
recover their data but not always. 0% never lost everything 



when they tried to recover the lost data, but the question is for 
those who do not have a backup system; thus it goes to an 
individual level. 




SUCCESSFULLY 

■ VERY OFTEN 

■ NO 

□always 



Figure 8: Data recovery after failure 



Proposed Model 

This study proposes improvement in the Forward Error 
Correction (FEC) in the storage media using Locked 
Convolutional Encoder and the enhanced Viterbi Algorithm 
with Non transmitted code words. In writing, the source data 
will be modulated, then encoded through locked convolutional 
encoder and at last written to the storage media. During 
reading the data will be decoded through Non transmitted code 
words (NTC)- Viterbi decoder then demodulated and finally 
the data can be retrived. It is over two decades since Iterative 
Detection Read Channel (IDRC) technology was adapted in 
hard drive design[57]. This approach is delivers enhancement 
to signal and noise ratio that are unique and improves the 
reliability, resiliency, and overall storage capacity of the 
storage drives. The proposed model adopts the mentioned 
IDRC architecture as shown in Figure 8. This architecture has 
three levels of coding where modulation and locked 
convolutional codes are concatenated in the media writing 
process. A reverse concatenation is done on the reading 
process. A study on the design and evaluation of the proposed 
architecture is our next task. Error correction in magnetic field 
started in 1960s, with the application of Fire codes. Later on 
Reed Solomon (RS) took the major role in this area. RS is 
currently the principal error correcting codes being used in 
optical disc memories. RS stretched out when Compact Disc 
(CD) was introduced and the CD ROM were introduced at the 
beginning of 1980s [58]. However, data loss problem due to 
storage media failure is still a challenge in the industry, 
leaving a door open for researchers to find a more relevant 
solution 
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Figure 9: Proposed model 
IV: CONLUSION 

This paper reviewed the challenges facing the industry on 
storage media. It has been revealed that data reliability can be 
increased by adopting or introducing powerful Forward Error 
Correction Code (FEC) in digital storage media. It has been 
witnessed that storage industries are confronted with disasters 
of varying degrees; yet, people in the industries are not 
interested in improving the storage devices reliability and fault 
tolerance, but are interested in improving the backup and 
recovery systems which have cost implication. Observation on 
introducing powerful FEC suggested the use of new technique 
which uses locked convolutional encoder and NTC-Viterbi 
decoder. The technique shows that if improved and adapted, 
may give better results compared to the current techniques. 
Therefore we recommend a further study on improving the 
new technique so that we can improve the current situation 
facing the storage industries. 
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Abstract — Approval of the strategy, ‘Consumerization of IT’ by 
the organizations, does not only save money and increase business 
agility, but also improves employee productivity and satisfaction, 
lowers IT procurement, support costs and improves collaboration. 
Organizations have started to develop “Bring Your Own Device” 
(BYOD) policies to allow their employees to use their owned 
devices in the workplace. It’s a hard trend that will not only 
continue but will accelerate in the coming years. In this paper we 
focus on the potential attacks that can strike on BYOD when 
authenticating to a Wi-Fi network. It also enumerates the 
authentication protocols and methods. A proposal for stringent, 
indigenous hybrid authentication for a device that embraces 
BYOD strategy in a Wi-Fi enabled environment is proposed to the 
end of the paper. 

Keywords — Bring Your Own Device (BYOD), Wi-Fi, 
Authentication, Authorization, Certificate. 

I. Introduction 

Recent years have noted the outburst in consumer mobile 
computing device, accompanied by their attractive user friendly 
and falling prices which makes these mobile devices requisite 
and within easy reach of the common man. Today, with even 
greater advances in consumer technology, mobile applications 
and the affordability of smart and powerful mobile devices, 
organizations are more challenged than ever to incorporate 
them into the enterprise IT architecture. The main concern of 
the IT department is that of security and data privacy risks that 
accompany with the BYOD movement with increased support 
costs. 
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BYOD not only holds the promise of not enabling 
companies to become more agile and customer focused, but 
also helping employees swiftly create and apply knowledge at 
work, which is key driving competitive lead in a knowledge - 
driven economy. The issue today is for the enterprise to 
embrace these changes in ways that improve organizational 
effectiveness and productivity while mitigating risks. The 
utmost concern of BYOD is the consequences of the usage of 
the unsecured personal mobile devices for handling corporate 
data. 

‘Consumerization of IT’ is transmuting the traditional IT 
landscape of organizations and the way employees use 
technology for work purposes. Using their own devices help 
employees to handle the device in an efficient manner as they 
are more familiar and comfortable with functionalities of the 
device and also makes it possible for the employees to flexibly 
work from their home or on the road as per their convenience. 
Also using its own device means an employee will take an extra 
care for its safeguard. Since corporate information and personal 
information are on the same device, the ease of use to fetch 
information also enhances. Due to this the communication and 
work would be faster and efficient. 

According to Gartner’s study published in Symantec report 
2015 [14], 

• 77% of the employees use their own phones for work. 

• 74% enterprises allow employees to bring their own 
devices to work. 

• The number of devices managed in the enterprise 
increasing by 72% from 2014-2015. 

Security complements quality. Ignorance to the security 
aspect has disastrous effects on the organization leading to loss 
of confidential data and organization’s reputation. When 
employees try accessing Wi-Fi through the devices, a stringent 
authentication is required. Security concerns have held back 
Wi-Fi adoption in the corporate world. Hackers and security 
consultants have proved how vulnerable is the current security 
technology, wired equivalent privacy (WEP), used in most Wi- 
Fi connections. Data is the most crucial asset to any 
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organization. With emergence of BYOD it has become more 
obligatory to implement a stringent, indigenous, hybrid 
authentication mechanism. The rest of the paper is organized as 
follows, Section II presents the literature survey for BYOD, 
introduction to attack that are pertinent to BYOD, 
authentication standards, methods and protocols practiced for 
the ‘consumerization of IT’. In section III we pen down the 
attacks, its mechanism, scenarios, symptoms and their 
consequences. In section IV a scheme is proposed for a 
stringent indigenous, hybrid authentication for BYOD strategy 
in a Wi-Fi enabled environment. 

II. Bring your own device (byod) trend 

A. Mobility Strategy in an organization 

The four mobile strategies that come up are, here is your 
own device (HYOD), choose your own device (CYOD), Bring 
Your Own Device (BYOD), On Your Own Device (OYOD). 

1 ) Here is your own device (HYOD) 

The devices are provided by the organization. There is total 
control on the device by the enterprise. Organization provides 
total support for the device starting from installation to 
configuration and settings of the device [1]. 

2) Choose Your Own Device (CYOD) 

The organization provides the employees with bunch of 
devices and employee makes his choice amongst them. Policies 
are not very stringent for these, they can install some specific 
applications and software [1]. 

3) Bring Your Own Device (BYOD) 

Employee buys a device of his choice and pays for the same 
and uses it for his work at the organization. Policies are thus 
weaker and the organization has less control over these devices. 
Users can install applications as per their choice, if that 
complies with the policies and regulations of that organization 
[ 1 ]. 

4) On Your Own Device (OYOD) 

The end user, i.e., the employee can buy and use any device. 
No policies are imposed. 

BYOD can assure security as well as productivity if correct 
measures are taken, thus it is the recent trend seen in the 
corporate environment and is expected to flourish at a higher 
rate. 

B. BYOD and its market overview 

Bring your own device (BYOD) is a global phenomenon, 89 
percent of IT departments enable BYOD in some form [15]. It 
offers advantages namely, enhanced productivity, increased 
revenue, reduced mobile costs, IT efficiencies. 

With large numbers of employees already having smart 
technology, some organizations view this as an opportunity to 
implement new technology without having to pay for the devices 
themselves. Some major corporations and organizations choose 
to avoid changing their security protocols and migrate to BYOD 
because they do not want to risk the increased exposure to cyber 



threats and data breaches. Another major reason why some 
corporations avoid switching to BYOD is because it is still 
relatively new and poses far numerous security threats, from a 
data security point of view, which could be found in the devices 
or even in their apps. 

C. BYOD security risks [7] 

1) Unified policy management 

2) Securing and delivering corporate network access and 
services. 

3) Device protection 

4) Secure data transmission 

D. Attacks pertinent to BYOD 



Table No 01: Taxonomy of BYOD attacks [2] 



Component 


Security Attacks 




Active 


Passive 


Privacy 


User 


- Man in 
Middle 
-Social 
Engineerin 
g 


-Eavesdropping 


Data 

privacy 

for 

company 

and 

client 


Network 


SSL Attack 




Software 


Malware 

APT 




Physical 




Lost and stolen 
devices 


Web 


SQL 

injection 





1 ) Social Engineering (SE) 

Social engineering is the art of manipulating people so they 
give up confidential information. 

2) Distributed Denial of Service (DDOS) 

A denial of service is characterized by an explicit attempt 
by an attacker to prevent authenticate users from using 
computing of resources. 

3) Inside Attack 

An insider attack is a malicious attack perpetrated on a 
network or computer system by a person with authorized 
system access. 

4) Man in middle attack (MITM) 

A man-in-the-middle attack is an attack where the attacker 
secretly relays and possibly alters the communication between 
two parties who believe they are directly communicating with 
each other [2] . 

5) Secure Socket Layer attack (SSL) 

The secure Socket layer attack is a type of attack that focus 
on breaching the vulnerabilities of the network protocol. 
SSL/TLS is the common protocol that targeted by the attacker 
for this attack [2]. 
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A. Authentication Protocol 



III. Authentication and Authorization 



Table No 02: Comparative analysis of authentication protocol [29] 



Sr No 


Authentication 


Features 


Drawbacks 


1 


PAP 


Each user who wants to access the network should 
have a registered username and password to network 
access server. PAP is a point to point authentication 
protocol. 


PAP is a weak authentication without 
any encryption. 

PAP is vulnerable to password 
guessing. 


2 


CHAP 


CHAP is a three way handshake authentication 
protocol. 

Authentication is based on username and password. 
It can be used with other wireless authentication 
protocol such as TTLS. 

It has hash value method for security. 


It is for point to point connection and 
it is not for such mesh network. 
CHAP originally designed for wired 
network. 

It is not as useful for large network 
since every possible is maintained at 
both ends of link. 


3 


Shared key 


Shared key is a authentication protocol used in IEE 
802.11. 

The shared key uses WEP or WPA for encryption, 
uses a four- way handshake authentication. 


Shared one key for all users. 

No mutual authentication. 

Vulnerable to inside and packet 
spoofing attack. 

When changing the key one should 
advertise to all users. 


4 


EAP-802.1x 


It is a port based authentication. 

It is a point to point authentication protocol. 

It is for single hope wireless network. 

It is flexible with different other protocols and can 
implement with new protocols. 

It is a data link layer protocol. 

It provides a way to dynamically send keys to clients. 


It cannot support multi hop network. 
It is a framework not a protocol, this 
can found from its list of protocols. 

It needs a separate encryption 
protocol with self. 


5 


EAP-RADIUS 


The communication of the RADIUS is on different 
layers of OSI protocol stack. 

It can use different authentication protocols such as 
EAP-802.1x, EAP-TTLS, EAP-MD5, PAP, CHAP, 
and etc... 

Four type of message exchange between 
authentications related devices. 

It uses UDP for data transmission. 


The network that uses RADIUS has 
two parts, wired and wireless, and in 
the wired part data will be on clear 
text. 

It always needs another protocol and 
separate software for authentication, 
and encryption. 


6 


PANA 


PANA developed by IETF network group. PANA 
carry the authentication between client and 
authentication server. 

PANA work on multi hop network and point to point 
network. 

It is very new and still under progress PANA works 
based on IP layer. 

PANA is designed for mutual authentication, and 
fast re-authentication. 


The discovery and handshake phase 
is prone to spoofing attacks by a 
malicious node as there is no security 
relationship between PAA and PaC 
at that stage. 

Most of router firmware does not 
have this framework. 



B. Extended Authentication Protocol (EAP) Overview 

EAP is not and does not specify any authentication methods, 
but by using frame work provided by IEEE 802. 1 lx it supports 
various authentication methods. EAP supports multiple 
authentication methods, for example, one-time password, 
certificate, public key authentication, smart key, Kerberos. EAP 
authentication process can be explained as: 



1 ) The authenticator, which sends request of authentication 
for supplicates 

2 ) The user, responds ti each request. 

3) After an innings of several Response/Request messages 
the authetication ends. 

Authentication process is completed with a success or 
failure messages. 
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Table No 2: Comparison of authentication approaches under EAP [6] 





EAP- 

MD5 


EAP- 

LEAP 


EAP-FAST 


EAP-TLS 


EAP- 

PEAP 


EAP-TTLS 


Implementation 


Challenge 

Based 


Password 

Based 


PAC 


Certificate 

Based 


Server 

Certificate 


Server 

Certificate 


Authentication 

Attributes 


Unilateral 


Mutual 


Mutual 


Mutual 


Mutual 


Mutual 


Deployment 

Difficulties 


Easy 


Easy 


Easy to 
moderate 
depending on 
security 


Hard 


Moderate 


Moderate 


Dynamic Key 
Delivery 


No 


Yes 


Yes 


Yes 


Yes 


Yes 


Server 

Certificate 


No 


No 


Yes for 

maximum 

security 


Yes 


Yes 


Yes 


Supplicate 

Certificate 


No 


No 


Yes 


No 


Yes 


Yes 


Tunneled 


No 


Yes 


Yes 


Yes 


No 


Yes 


WAP 

Compatibility 


Weak 


Moderate 


Weak to 
secure 

depending on 
implementation 


Maximum 

Security 


Secure 


Secure 


WLAN Security 


Weak 


Moderate 


Weak to secure 
depending on 
implementation 


Maximum 

Security 


Secure 


Secure 


Vulnerabilities 


Identity 

exposed, 

Dictionary 

attack, 

MITM 

attack. 


Identity 

exposed, 

Dictionary 

attack. 


Maximum 
security is 

comparable to 
PEAP and 

TTLS. 


Identity 

exposed. 


MITM 
attack, 
Identity 
hidden in 
phase 2 but 
potential 
exposure in 
phase 1. 


MITM 



C. Authentication Methods 

1 ) Two factor authentication 

Two-factor authentication is a security process in which the 
user provides two means of identification from separate 
categories of credentials; one is typically a physical token, such 
as a card, and the other is typically something memorized, such 
as a security code. 

2) Multi factor authentication (MFA) 

Multifactor authentication is a security system that requires 
more than one method of authentication from independent 
categories of credentials to verify the user’s identity for a login 
or other transaction. 

3) Single Sign-on (SSO) 

Single sign-on is a session/user authentication process that 
permits a user to enter one name and password in order to 
access multiple applications. 

4) Public Key Infrastructure (PKI) 

A public key infrastructure supports the distribution and 
identification of public encryption keys, enabling users and 



computers to both securely exchange data over networks such 
as the Internet and verify the identity of the other party. 

5) Digital Certificate 

Digital certificate are basis for device authentication. 
Passwords are inherently vulnerable to phishing attacks, 
whereas user certificates are not [18]. 

a) General Certificate: 

Digital Certificates are a means by which consumers and 
businesses can utilize the security applications of PKI. The 
certificate contains the name of the certificate holder, a serial 
number, expiration dates, a copy of the certificate holder's 
public key and the digital signature of the certificate-issuing 
authority (CA) so that a recipient can verify that the certificate 
is real. 

b) Attribute Certificate: 

Attribute Certificate is property set issued by Attribute 
Authority (AA) related to owner, it defines owner permissions 
in systems. AC is on basis of general certificate based on the 
public key infrastructure certificate, prevent the attacker to 
forge. AC format is defined by the X.509 version 3. This 
includes two main operations, i.e., to register a certificate for an 
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employee and to load the certificate when he requests for 
authentication, thus verifying and validating the device. 

6) One-time password 

An OTP is more secure than a static password, 
especially a user-created password, which is typically weak. 
OTPs may replace authentication login information or may be 
used in addition to it, to add another layer of security. 

7) Smart key 

Smart cards can provide personal identification, 
authentication, data storage, and application processing. Smart 
cards may provide strong security authentication for single 
sign-on (SSO) within large organizations. 

8) Kerberos 

Kerberos is a computer network authentication 
protocol which works on the basis of 'tickets' to allow nodes 
communicating over a non- secure network to prove their 
identity to one another in a secure manner. It aims primarily at 
a client-server model and it provides mutual authentication, 
both the user and the server verify each other's identity. 
Kerberos protocol messages are protected 
against eavesdropping and replay attacks. Disadvantages of 
Kerberos protocol, 

a) Single point of failure 

b) The administration protocol is not standardized and 
differs between server implementations 

c) If infrastructure of an organization requires an increase 
across a highly distributed networks, Kerberos doesn’t 
have such scalability to be implemented. 

9) Secure Socket Layer 

Secure Sockets Layer protocol (SSL) is based on TCP/IP 
protocol, providing security for client and server 
communication. It uses asymmetric encryption technology to 
realize safety transfer of/from both sides information, this 
ensure the confidentiality and integrity of information, also can 
distinguish the identity of the conversation. 

In SSL protocol handshake process, there is a process 
asking for each other’s certificate for identity authentication 
which improves the safety of the connection but there still 
exists some drawbacks, 

1) It cannot provide access control functions 

2) Different users connect to the same server as they use the 
same authorization, which is a bad practical implementation 
practice. 

3) It can only provide one-to-one SSL connection 

4) Multi-level and multi-certificate chain of trust 
relationship cannot be achieved. 

D. Mobile Device Management (MDM) 

Mobile Device Management is a tool that in a centralized 
manner controls the devices and can do over the air 
configuration remotely to those devices that are connected to 
the network [5]. The MDM authenticates devices by 
exchanging certificates from the organization’s server, defining 
access rights. MDM continuously sync’s and stores backups of 
the data. All communication is generally secured by SSL/TLS 
to provide an encrypted channel. 

E. Device Fingerprinting 

Device fingerprinting allows devices to be identified, or 
fingerprinted, as an additional means of authentication. 



F. Device Encryption 

Encryption is a cornerstone of BYOD security. Encrypted 
VPNs, using IPSec provide for the confidentiality and integrity 
of data in transit, but this may leave data on the devices 
unencrypted. 



IV. Proposed Approach 

To avoid limitations in SSL protocol we propose a hybrid 
authentication system, combining use of attribute certificate 
with the general certificate to authenticate a device and the 
service and multi factor authentication to verify the user. When 
client requests for authentication it will be validated on basis of 
a general certificate and an attribute certificate. This allows to 
achieve access control along with multi-level and multiple 
certificate chain of trust relationship. 

In order to mitigate DDOS attacks we propose the concept 
of three-tier captcha with username, password. Three tier 
captcha is implemented to mitigate and resist attacks like, 
dictionary attack, pixel count attack, pre-processing and 
vertical segmentation [5]. 

Advantages of a three tier captcha are, 

• Enhances security 

• Is easy to use 

• Prevent automated attacks 

• Difficult to identify the patttern 

The basic flow of the authentication system will be as 
follows, 




Fig 02: Flowchart for the authentication mechanism 
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Authentication mechanism uses Oracle lOg as its database 
for storing the employee’s credentials. The User’s personal 
information is stored in Userlnfo table of database. When the 
user tries to connect to the AP of the organization’s Wi-Fi, It 
runs a malware scan over the device to check signs of malicious 
activity by the Mai wareS canner. 

1) MalwareScanner: 

Malware scanner checks for any malicious applications or 
services in the device before it connects to the organizations 
Wi-Fi. 

2) VerifyUser : 

When the user logs in, the server firstly verifies his 
username/password. 

3 ) InputThreeTierCaptcha: 

User needs to enter captcha based on the query displayed on 
screen. Server will verify the input, this will help mitigate 
DDOS attacks. This will help mitigate the DDOS attacks. 

Algorithm of three -tier captcha 

1) Create random string, combination of alphabets, 
numbers and symbols 

2) Create an image of it with few noise 

3) Generate a random query related to the code 

4) Put the image and query onto the user interface 

5) Allow user to provide the input 

6) Examine input by user with value in the session 

7) If input is correct, allow user to proceed 

8) If input is incorrect, generate another captcha image 
and give limited chances, say 3, if in three attempts the user 
does not provide the correct username, password and captcha 
answer a text message is send to employees number/email 
account and an entry is made to server logs. 

After verifying the user’s username/password, 
authentication system will query user’s information of 
certificate in the back-end database. If the user has applied for 
a digital certificate, the system will automatically jump to the 
certificate validation page. 

Otherwise the system will automatically jump to the page 
of requesting certificates, and guide the user for installation of 
digital certificates. 

4 ) InstallCertificates: 

Installing certificates enables device fingerprinting, 
distribution of the Wi-Fi settings to the device used by the 
employee which helps him to access the Wi-Fi and the 
resources through it. 

5 ) BindCertificates: 

When the user applies for digital certificates successfully, 
the system will bind new digital certificates applied with 
the user’s personal information. The user’s personal 
information has bound with the digital certificate 
applied in the database. The workflow of binding the 
user’s personal information with digital certificates is 
shown as follows: 

a) Download BindKey control to the local Web browser. 

b) Call function GetPK of the control to extract the public 
key of digital certificates, and send the public key to the 
server. 

c) After receiving the public key, the server enteries the 
information of public key to the corresponding records in the 



table. Userlnfo according the username of the user logged in. 
Then the binding of personal information and the digital 
certificates completes. 

6) Verify Certificate: 

After binding personal information with the digital 
certificates successfully or when the user logs in again, the 
authentication system will verify user’s certificate, also sends 
key attribute values in attribute certificate to AA, AA verifies 
authenticity of attribute certificate, and determines client’s role 
authority, returns the result back to server finally. When 
handshake process is over, 

a) User can communicate with server with role which 
attribute certificate aproves, server can ensure the 
security of the system through giving corresponding 
authority according to users’ role. 

b) Server should ask user for certificate validation and 
attribute certificate validation at the same time. 

c) After user receives the request, it sends its public key, 
certificate and attribute certificate to server. 

d ) Server verifies the certificate after receives it, then sends 
key attribute values in attribute certificate to AA, AA 
verifies authenticity of attribute certificate, and 
determines user’s role authority, returns the result back 
to server finally. 

e) After the user and the server have verified each other, 
then to the following steps. 

f) When handshake process is over, user can communicate 
with server with role which attribute certificate gives it, 
server can ensure the security of the system through 
giving corresponding authority according to users’ role. 

The attribute certificate added to SSL protocol handshake 
process describes as figure below: 

CLIENT SERVER ATTRIBUTE 

AUTHORITY 

Support encryption algorithm, 

OOdPICL numbers 

CLIENT HELLO ► SERVER HELLO 

CERTIFICATE 
SERVER KEY EXCHANGE 
CERTIFICATE REQUEST 

Selected encryption algorithm, random number, Certificate 

4 SERVER HELLO DONE 

CERTIFICATE 
CLIENT KEY EXCHANGE 

Send attribute certificate 
key values to AA 

SEND ATTRIBUTE 

CERTIFICATE * 

ATTRIBUTE 
CERTIFICATE 
VERIFY 

Return the validation 
GET CLIENT ROLE K f tbaCtt0Sener 

[CHANGE CIPHER SPEC] 

Handshake information. MAC value 
FINISHED ¥ 

[CHANGE CIPHER SPEC] 

Handshake information. MAC value 
4 FINISHED 

APPLICATION DATA « ► APPLICATION DATA 

Fig 02: Attribute certificate mechanism [4] 
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7) DataDecrypter: 

Once the device, user is authenticated and authorized for the 
given service the confidential data gets decrypted. 

8 )DataEncyrpter: 

When the employee, disconnects the organizational 
Wi-Fi, the data termed under confidential and critical are 
encrypted. 

Conclusion 

A hybrid authentication system is required to ensure 
productivity and efficiency adopting BYOD trend. A malware 
scanner scans the device when the device tries to connect to the 
organization’s Wi-Fi. A three tier captcha collective with 
username and password, will authenticate and verify the user. 
Attribute certificate mechanism in SSL enables multi-level 
trust relationship. Attribute certificate has certificate signer 
unique identifier and extensions, can add certificate signer’s 
certificate in extensions, when users’ own certificate cannot be 
verified by the other side, they can call extensions of attribute 
certificate to ask for certificate of certificate signer, which can 
achieve the goal of multi-level authentication. 
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Abstract — Cognitive radio (CR) technology is used in wireless 
networks with the aim of handling the wireless spectrum scarcity. 
On the other hand, using CR in wireless networks has some new 
challenges and performance overheads that should be resolved. It 
is crucial to study the effects of the unique characteristics of 
cognitive radio over the various protocols. In this paper, a 
simulation-based performance evaluation is presented in the term 
of MAC delay in a CR user. The effects of spectrum sensing 
parameters (sensing duration and frequency), the primary users’ 
activity and the number of spectrum channels are investigated on 
the delay of MAC layer. The growth and decay rates of MAC 
delay are studied more in detail through the various NS2-based 
simulations. The results give some fruitful insights for 
formulating the delay of MAC based on the CR unique 
parameters. 

Keywords-Cognitive radio; MAC; Delay; Primary user; 
Spectrum sensing 

I. Introduction 

Taking the advantage of cognitive radio (CR) technology in 
order to use the wireless spectrum dynamically has been widely 
used in wireless networks [1]. Cognitive radio enables the 
dynamic spectrum access (DSA) approach in wireless networks 
[2]. Wireless networks with CR-enabled nodes are generally 
called cognitive radio networks (CRNs) [3]. A CR node 
operates based on a cognitive radio cycle [4]. The cognitive 
radio cycle composed of four main stages: spectrum sensing, 
spectrum decision, spectrum sharing and spectrum hand-off 
[5]. A CR node is allowed to communicate on a wireless 
channel in the absence of primary user (PU) who has priority to 
use the spectrum channels [6]. The unique characteristics of 
CRNs, i.e., periodic spectrum sensing and operating in the 
absence of primary user, make new challenges in the modeling 
and performance evaluation of various protocols in MAC, 
network and transport layers. The performance evaluation of 
different protocols in CRNs can be fruitful in order to 
investigate the overhead of using CR technology in wireless 
networks. Therefore, being aware of the CR overhead in 
wireless networks can lead to design some efficient protocols 
for CRNs. 

There are a lot of research activities in the literature about 
the performance evaluation of various protocols in CRNs. 
Researchers have evaluated the performance of CRNs in the 



terms of diverse factors such as throughput, packet loss, delay 
and jitter. 

In [18], the authors model and evaluate the sending rate 
distribution of source nodes in the transport layer of CRNs. The 
[10] establishes a balance between packet delay and sensing 
time through proposing a MAC protocol. In [15], an admission 
control is proposed in order to support delay sensitive 
communications of CR users in cognitive radio networks. 
Authors of [9] propose an admission control with the aim of 
minimizing the end-to-end delay and jitter in CRNs. Expected 
packet delay in CRNs is modeled based on waiting and 
transmission delays in [17]. The authors of [21] propose two 
optimal scheduling methods in order to maximize the 
throughput and minimize the scheduling delay in CRNs. The 
stochastic delay and backlog bounds of transport layer in CRNs 
are modeled in [16]. In [13], the optimality of congestion 
control schemes in cognitive radio networks are investigated in 
order to minimize the congestion probability and the delay of 
transport layer. The authors of [20] investigate the challenges 
of delay- sensitive data transport based on the unique features of 
cognitive radio. In [14], [11] and [19], the effect of dynamic 
spectrum access on the throughput of transport layer is studied 
in cognitive radio networks. The impact of CRNs 
characteristics over the performance of routing and transport 
protocols is evaluated based on simulations in [12]. The packet 
loss probability, end-to-end delay and throughput are evaluated 
based on the cognitive radio parameters in [22]. 

Although there are several studies over the performance 
evaluation of CRNs, it is needed to study on the delay behavior 
of MAC layer for a CR node individually. Investigating the 
behavior of MAC delay for a CR node can give us an 
appropriate insight to 

• design the efficient MAC protocols for CRNs and 

• adjust the existing MAC protocols and CR-related 
parameters 

with the aim of improving the MAC delay performance of 
CRNs. 

In this paper, we propose a simulation-based study on the 
MAC delay behavior of a CR node. The impact of the spectrum 
sensing, primary user activity and the number of wireless 
channels on the MAC delay is investigated by various 
simulations. The increasing/decreasing rate of MAC delay is 
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studied and compared based on the changing of various 
parameters. The results of this paper about the behavior of 
MAC delay can be helpful in order to formulate the delay of 
MAC with regard to CR unique characteristics. Also, it can be 
used to have some fruitful insights to determine the appropriate 
amounts of sensing duration, sensing frequency, the number of 
wireless channels in CRNs with the aim of minimizing the 
MAC delay. 

In the rest of paper, the Section II describes the system 
model. The Section III studies the performance of MAC delay 
based on the sensing duration and frequency (Section III-A), 
the number of wireless channels (Section III-C) and the PU 
activity (Section III-B). Finally, the Section IV concludes the 
paper. Ease of Use 

II. System Model 

A. Spectrum Sensing 

A CR node needs to sense the wireless spectrum for a 
predefined duration (sending time) and check the presence of 
primary user (PU) who has the license to use the spectrum 
channels. If there is no PU on a wireless channel, the CR node 
is allowed to communicate on the free wireless channel for a 
specific time (operating time). Since a CR node cannot sense 
the spectrum and send data on it simultaneously, the spectrum 
sensing is done periodically with a specific period with the aim 
of minimizing the amount of interference between the 
communication of CR nodes and the PU [7]. The ideal sensing 
is assumed without any errors in detection PU presence. 

Let t s and f s be the spectrum sensing duration and 
frequency, respectively. In the other words, a CR node senses a 
spectrum channel for t s with the period of—. Each period is 

fs 

composed of two durations. The first one is named the 
spectrum sensing duration and the second one is called the 
operating duration. Let to be the operating duration of a CR 
node at each period that is equal to t 0 = t s . 

fs 

B. Primary User Activity 

Primary users have higher priority to use the wireless 
spectrum channels. Modeling of PU activity has a high degree 
of importance because of its impact on the communications of 
CR users. The most common model for PU activity is the two- 
state birth/death Markovian process [8] with birth rate of (3 and 
death rate of a [7]. In CRNs, two states of Markovian process 
are named ON and OFF states. In the ON state, the wireless 
channel is busy by the primary user (the PU is ON (active) on 
the channel). In the OFF state, the PU is not active on the 
channel. The birth rate ((f) and death rate (a) are called the 
entrance and departure rates of primary users, respectively. 

C. Wireless Channels 

Let N be the number of wireless channels. For each 
channel, there is a licensed primary user that enters the channel 
with the mean entrance rate of / 3 and leaves it with the mean 
departure rate of a. 



III. Performance Evaluation of MAC Delay 

In this section, the delay of MAC is investigated based on the 
various parameters, i.e., spectrum sensing duration and period, 
the number of wireless channels and the entrance and the 
departure rates of primary users. Our simulations are done 
based on NS2 [23] and CogNS [22] that is an NS2-based 
simulation framework for cognitive radio networks. The 
default simulation setup is illustrated in Table I. The default 
number of wireless channels is 1. The sensing and operating 
durations are considered 0.01 sec and 0.6 sec, respectively. 
The activity parameters of primary users are determined 
(a, (3) = (3,1). The packet size is selected 512 bytes. The 
duration of simulations are considered 500 sec. 



TABLE I. Default Simulation Settings 



Parameter 


Default Value 


The number of wireless channels (N) 


1 


Spectrum sensing duration (t s ) 


0.01 sec 


Operating duration (t 0 ) 


0.6 sec 


PU departure rate ( a ) 


3 


PU entrance rate (/?) 


1 


Packet size 


512 bytes 


Simulation time 


500 sec 



A. Spectrum sensing duration and frequency 

Spectrum sensing duration and frequency are two basic 
factors of cognitive radio that have significant influence on the 
performance of CRNs and the quality of service (QoS) of 
primary users. Based on the amounts of sensing duration (t s ) 
and operating duration (t 0 ), sensing efficiency can be defined 
as follows [7] : 



Maximizing the amount of sensing efficiency increases the 
spectrum utilization of CR users. Therefore, the maximizing of 
sensing efficiency is an objective in CRNs in order to reduce 
the amount of interference on primary users. The amount of £ 
tends to 1 when either the amount of t s tends to zero or the 
amount of t 0 tends to oo. On the other hand, the small values of 
t s leads to low PU detection probability that increases the 
amount of interference on primary users. Also, the decreasing 
of t s increases the false alarm probability that reduces the 
spectrum utilization of CR users. From the point of view of CR 
users, increasing the value of t 0 raises the sensing efficiency. 
However, high value of t 0 can increase the interference on 
PUs. Therefore, the selection of optimal sensing and operating 
durations is a crucial task in CRNs [7]. 
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Figure 1 . The behavior of MAC delay based on the changes of spectrum 
sensing duration 




Figure 2. The behavior of MAC delay based on the changes of operating 
duration 



Studying the impact of sensing and operating over the 
various factors in CRNs can help us to find the appropriate 
values of t s and t 0 . In this paper, we focus on the delay of 
MAC layer in a CR node. Generally, we expect that the 
decreasing of t s and the increasing of t Q reduce the MAC delay 
of a CR node. On the other hand, the small amounts of t s and 
the large amounts of t 0 can violate the main objectives of 
cognitive radio networks. 

We investigate the behavior of MAC delay per the different 
values of t s and the various number of wireless channels. The 
value of t s varies from 0.01 second to 2 seconds. The MAC 
delay is calculated based on the values of t s for the number of 
channels (N) of 1 and 6. All of other parameters are considered 
as mentioned in Table I. The results are depicted in Fig. 1. As 
seen in Fig. 1, increasing the value of t s raises the value of 
MAC delay. In this figure, the solid and dashed curves 
represent the MAC delay values for N=1 and 6, respectively. 

According to the Fig. 1, the delay of MAC increases 
linearly with t s independent of the number of channels. The 
slope of curves is changed around the t s = 0.08 second for both 
of N=1 and 6. The growth rate of MAC delay (the slope of 
curve) between the t s = 0.01 sec and t s = 0.08 sec is more than 



the growth rate between the t s =0.08 sec and t s = 2.0 sec 
independent of the number of wireless channels. As seen in 
Fig. 1, the growth rate of MAC delay with t s is different for 
N=1 and 6. For N=6, the slope of curve is smaller. As a 
consequence, the changing of the sensing duration has more 
impact on MAC delay when there are the smaller number of 
wireless channels. 

The behavior of MAC delay versus the values of t 0 from 
0.01 to 4 seconds is depicted in Fig. 2. The MAC delay is 
calculated for the number of channels (N) of 1 and 6. All of 
other parameters are considered based on the values of Table I. 
Based on the Fig. 2, increasing the value of t 0 reduces the 
value of MAC delay. The solid and dashed curves represent the 
MAC delay values for N=1 and 6, respectively. 

Based on the Fig. 2, a non-linear decrease of MAC delay is 
observed with to for different numbers of wireless channels. 
The decade rate of MAC delay with t 0 is different for the 
various number of channels so that the decade rate for N=6 is 
greater than the decade rate for N=l. As a consequence, the 
changing of the operating duration has more impact on MAC 
delay when there are the greater number of wireless channels. 

B. Entrance and departure rate of PU 

The activity of primary users on wireless channels has a 
great impact on the performance of CR users. Based on the 
activity model of primary users, the probability of the presence 
a PU on a channel is as follows [7] : 



where the and are the mean departure and entrance rates of 
primary users, respectively. The probability that a CR user can 
find a channel free of primary users can be calculated as 
follows: 

Pfree = 1 " ( P 0 n) N (3) 

where N is the number of wireless channels. Generally, we 
expect that the large amounts of / 3 and the small values of 
a increase the MAC delay of a CR user. 

In order to investigate the effect of a on MAC delay more 
in detail, we study the behavior of MAC delay versus the 
different values of a and for the various values of /? . We 
change the value of a from 1 to 6 and study the behavior of 
MAC delay. This simulation is repeated for three different 
values of /?= 1,2 and 3. All of other parameters are determined 
as explained in Table I. The results are depicted in Fig. 3. As it 
can be seen in the Fig. 3, the increasing of a reduces the MAC 
delay. 

According to the Fig. 3, a non-linear decay of MAC delay 
can be seen with a for all values of /?. The decay rate of MAC 
delay with a is different for the different values of /?. For larger 
values of / 3 , the rate of decay is larger. As a consequence, the 
changing of a has more effect on the MAC delay when the 
mean entrance rate of primary users is greater. 

In order to study the impact of /? on MAC delay more in 
detail, we consider the behavior of MAC delay versus the 
different values of /? and for the various values of a . We 
change the value of /? from 1 to 5 and investigate the behavior 
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Figure 3. The behavior of MAC delay based on the changes of mean 
departure rate of PU 




Figure 4. The behavior of MAC delay based on the changes of mean 
entrance rate of PU 




Figure 5. The behavior of MAC delay based on the changes of number of 
wireless channels 

of MAC delay. This simulation is repeated for three different 
values of a = 1,2 and 3. All of other parameters are determined 
as explained in Table I. The results are depicted in Fig. 4. As it 
can be seen in the Fig. 4, the increasing of /? raises the MAC 
delay. 



Based on the Fig. 4, a non-linear growth of MAC delay can 
be seen with /? for all values of a. The growth rate of MAC 
delay with /? is different for the various values of a. For 
smaller values of a, the growth rate is larger. As a result, the 
changing of /? has more impact on the MAC delay when the 
mean departure rate of PU is smaller. 

C. The Number of Wireless Channels 

The number of wireless channels has an inevitable effect on 
the MAC delay of CR users. The larger number of wireless 
channels increases the chance of a CR user to find a channel 
free of primary users because according to the (3), the larger 
value of N decreases the amount of the term (P 0N ) N as a 
consequence the value of Pf ree increases. 

In order to study more in detail the impact of the number of 
wireless channels on the MAC delay of a CR user, we evaluate 
the amount of MAC delay versus various values of N varying 
from 1 to 11 for the diverse activities of primary users, i.e., 
(a,/?)=(l,3), (a,/?)=(l,l), and (a,/?)=(3,l). The related curves 
are illustrated in Fig. 5. As seen in this figure, a nonlinear 
decrease in the amount of MAC delay is observed with the 
increasing the value of N for all considered activities of 
primary users. The decay rate is larger for more active primary 
users. 

IV. Conclusion 

The behavior of MAC layer of a CR node with regard to the 
sensing duration, the operating duration, the activity of primary 
users and the number of wireless channels investigated through 
various simulations. The MAC delay increases linearly with 
increasing the value of sensing duration. By increasing the 
operating duration, the delay of MAC decreases non-linearly. 
The reducing of the mean entrance rate of primary users leads 
to a sharp decline (non-linearly) in the MAC delay. The 
increasing of the mean departure rate of primary users reduces 
the MAC delay. However, the decade rate of MAC delay due 
to the reducing of the PUs’ entrance rate is more than the 
increasing departure rate of primary users. Moreover, the MAC 
delay goes up non-linearly with lessening the number of 
wireless channels. 
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Abstract — Cooperative diversity is becoming a potential 
solution for future wireless communication networks due to its 
capability to form virtual antenna arrays for each node (i.e. 
user). In cooperative networks, the nodes are able to relay the 
information between the source and the desired destination. 
However, the performance of the networks (for instance - 
mobile networks, ad-hoc networks and vehicular networks) is 
generally affected by the mobility of the nodes. As the nodes’ 
mobility rapidly increases, the networks are subjected to 
frequency offset and unknown channel properties of the 
communication links which degrades the system’s 
performance. In a practical scenario, it is a challenging task 
and impractical for the relay and destination to estimate the 
frequency offset and channel coefficient especially in time- 
varying environment. In this manuscript, a comprehensive 
survey of existing literature for receiver designs based on 
Double Differential (DD) transmission and Multiple Symbol 
Detection (MSD) approach is presented to eliminate the 
complex channel and frequency offset estimation. 

Index Terms — Cooperative Diversity, Double Differential, 
Frequency Offset, Multiple Symbol Differential Detection. 



I. Introduction 

I N recent years, wireless cooperative diversity has gained 
significant attention because of its capability to achieve 
low Bit Error Rate (BER), high network throughput, high 
data transmission reliability as well as spectral efficiency to 
support the demand for the rapid growth in wireless 
communications. By exploiting the broadcast nature and 
diversity gain of the wireless communication, cooperation 
between users can be realized without the requirement of 
physical antenna arrays being installed at the transmitting 
and receiving nodes [1,2]. The idea behind this technique is 
to enable the source to broadcast signals following 
independent wireless path towards its destination with the 
help of other node(s) that act as relay(s) in the transmission 
schemes such as One-Way Relay Network (OWRN) or 
Two-Way Relay Network (TWRN) in [3] as depicted Fig. 1. 
For OWRN, the transmission phase is divided into the 
broadcast and relayed phase. In the broadcast phase, the 
source broadcasts its information via a relay and directly 
towards the destination. 
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The source then remains idle during the second (i.e. relayed) 
phase. Simultaneously, the relay processes the received 
signals and relays the processed signals to its intended 
destination. In TWRN, when two nodes are communicating 
with each other through a relay, the relay processes the 
superimposed (summed) of the received signal from both 
nodes and broadcasts it back to its corresponding node [3,4]. 



Phase 1: ^ 

Phase 2: > 




(b) 

Figure 1: (a) One-Way Relay Network (OWRN) and 
(b) Two-Way Relay Network (TWRN) [3] 



The received signal at the relay is processed based on 
several relaying protocols, such as Decode-and-Forward 
(DF) and Amplify-and-Forward (AF) [2,5]. For DF 
protocol, the relay decodes the source information and 
encodes the signals before retransmitting the information to 
the desired destination. However, the DF protocol may 
suffers from the error propagation problem constraint by the 
transmission of erroneous signal which deteriorates the 
whole system. On the contrary, in AF protocol, the relay 
receives the source information and simply amplifies the 
signals (both information and noise) with certain 
multiplication factors before retransmitting the scaled 
version of the signals towards the destination. The AF 
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protocol is preferred over DF protocol due to its simplicity 
and importantly suitable for low-power consumption 
networks without error propagation [6]. 

At the destination, all the relayed signals and directly 
transmitted information is detected and combined using 
various diversity combining techniques in order to improve 
the performance of the faded received signals. Most of the 
existing literature assumes that the perfect channel 
coefficients knowledge is known at the relay and 
destination. But, in practical applications, it is a difficult and 
challenging task to perform perfect channel estimation due 
to the variation effects of the channel, especially in the fast 
fading environment. Hence, differential transmission 
scheme is studied in [7-9] by taking the advantages of 
eliminating channel knowledge in the networks or in 
situations wherein the channel information is unavailable. 
However, in a real-world wireless mobile network with the 
presence of frequency offset caused by the Doppler Shift - 
changes in frequency when a node moves toward source, the 
differential detection may fail to attain the desired 
performance. Moreover, due to the mobility of the nodes, 
the oscillators of the source, relay(s) and destination 
experience are experiencing difficulties to synchronize 
perfectly [10]. 

Generally, there are two major approaches to deal with 
the presence of frequency offset in wireless communication 
systems. One of the approaches to alleviate the frequency 
offset is to design tremendous estimators that estimate the 
frequency offset. The estimated offset is then compensated 
by way of implementing tracing circuit as well as frequency 
acquisition [11]. These, however, demonstrate a significant 
reduction of data rates owing to the transmission of pilot 
symbols as a reference in the estimation process, and 
increase the computational complexity. Thus, Double- 
Differential (DD) transmission is devised in [12] to remove 
the frequency offset effects. Another proposed approach is 
the Multiple Symbol Differential Detection (MSDD) 
demonstrated in [13]. This manuscript aims to provide a 
survey of the receivers’ design in cooperative diversity 
system so as to alleviate the frequency offset problems. 

The manuscript is structured as follows. In Section II, a 
discussion on overview of the DD transmission scheme is 
described. In Section III, the concept of MSDD technique in 
the presence of carrier offset bypassing the channel 
knowledge incorporate in cooperative diversity is presented. 
Finally, conclusions and future directions are drawn. 

II. The double differential transmission 

The DD transmission approach is one of the techniques 
employed to mitigate the performance degradation in the 
presence of frequency offsets. Fig. 2a illustrates the DD 
encoder block, wherein, x[n ] represents the signals in M- 
Pulse Shift Keying (M-PSK) constellation at time n and D 
denotes the delay. The transmitted signal z[n\ at time n is 
described by x[n ] x y[n - 1] x z[n - 1] . In view of a point-to- 
point wireless communication link with carrier offset, the 
received signal can be attained as r[n] = he wn r[n\ + e[n\ 

illustrated in Fig. 1(b) where, h represents the channel gain 
and w is the unknown carrier offset with e[n\ denotes the 
Additive White Gaussian Noise (AWGN). 



The decision variable x[n ] is represented by p[n\p[n - 1] * , 
where p[n ] = r[n]r[n - 1] * . 




(a) 




(b) 

Figure 2: A Block Diagram of Double Differential (DD) 
(a) Modulation and (b) Demodulation. [14] 

By employing this approach, the information detection 
can be obtained based on the previously received 
consecutive symbols which bypass the unknown channel 
coefficients knowledge. This approach is investigated in 
[14] for the OWRN cooperative diversity system as 
illustrated in Fig. 3. In this scheme, AF protocol is employed 
in the presence of random frequency offsets over a block 
fading Nakagami-m channels which remain static for at least 
three sequential time intervals. Referring to Fig. 3, the 
information signal is encoded using DDM (i.e. DD 
modulation) before being broadcasted to the relay and 
directly towards the destination in half-duplex transmission 



During the first phase, the signal received at the relay is 
described by: 

x s /n) = yF x h sr e JWs - rn z(n) + e sr (n), n=0, 1,..., (1) 



x s ,d(n) = ■iP\h s ,d eJWs,dn z ( n ) + e s,d( n )’ n=0 ’ L ■■■■ ( 2 ) 

where, P x denotes the power transmitted by the source, h s r 
and h sd are the channel gains, w s>r and w s>d are the carrier 
offsets, and e sr and e sd are the Additive White Gaussian 
Noise (AWGN) between the source-relay and source- 
destination links respectively. 



manner. 



and the signal received at the destination is written as: 
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Figure 3 : Block Diagram of the DD transmission. [14] 



During the second phase, the source remains idle and the 
relay amplifies the signal of x sr in (1) and retransmits the 

signal towards the destination as: 

x r Jl) = ^h r y w ^ m x s Jl) + e r Jl), 1=0,1,..., (3) 

with / representing the time index that shows the time 
difference for the first and second phase, P 2 is the 
amplification factor at the relay, h r>d is the channel gain, w r>d 
is the frequency offset and e rd is the AWGN between the 
relay-destination link. 

The relayed and directly transferred signal are combined 
by utilizing the Maximal Ratio Combining (MRC) [15] 
scheme and the decision variable q(n) is proposed based on 
DD demodulator as shown in Fig. 3 as: 



where, 

K = 2Pfi\h r4 I 4 1 h sr I 2 C7 2 +2P 1 P 2 
X (Ptf r + <x 2 )| h r4 Tl h, r I 2 CJ 2 +P f I h d I 4 C7 4 +2 P 2 

x ( p \<7s,r + c 2 ) I K,d I 2 0-4 + ( p \ + o- 2 ) 2 <7 4 (7) 

From Eq. 7, in order to eliminate the channel knowledge, the 
MRC scheme replaces the channel coefficients with the 
channel variances. Then, the signal is decoded using 
Maximum Likelihood (ML) approach and can be written as: 

x(n) = arg max Re {q[k]x * } (8) 

xeE 



q(n) = Aj (r sd [n]r* d [n - l])(r s d [n - 1 ]r* d [n - 2])* 
+ A 2 (r r d [l]r* d [l - 1 ])(r r d [l - \}r* r d [l - 2])* 

where, k = n = l and ^as well as A 2 are formulated as: 

A 1 

1 {2P x \h s4 \ 2 +ct 2 ) ( t 1 

a (P^l+a 2 ) 2 
a 2 - 

K 



The Bit Error Rate (BER) analysis is provided which 
revealed that the MRC performed poorer than the ideal 

(4) MRC in [15]. Thus, instantaneous SNR is suggested and has 
been proved to perform approximately near to the ideal 
MRC scheme. In addition, the proposed scheme is able to 
predict the behaviour of the cooperative system without 
knowing the explicit channel knowledge and frequency 

(5) offset. The concept of DD transmission is also implemented 
in [13-15] respectively. Cano et al. in [16] suggested the 
DD modulation at the relays and simple heuristic detector in 
a distributive network as shown in Fig. 4. 

( 6 ) 




Figure 4 : Proposed DD Transmission System Model [16] 
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Refer to Fig. 4 the relay receives a block of K - 2 symbols 
of information s . The element of s is then mapped to a 
unitary diagonal matrix v k . A recursions approach is 
applied in the v k to attain the double-differentially encoded 
block signals x k . The symbols x k are then interleaved to 

form X and analysed through a multiplexing matrix. At the 
destination, the heuristic detector is given by: 

v k = argmin || Y ^ (y k y k . x "''J'hJ'w) f (9) 



Wu et al. [19] presented two schemes, namely Single 
Differential (SD) and Double Differential (DD) based 
detections employing AF in flat-fading channel of TWRN 
system in the presence of frequency offset. For SD, a three- 
stage approach as shown in Fig. 5(a) is carried out. 




The said scheme, however, requires the previous source 
knowledge and transmission ordering protocol 1 to ensure 
that each source acknowledge the corresponding relay. 
Hence, it may be inefficient in an ad-hoc network since the 
source may join or discontinue from the network [16]. 

DD transmission in a TWRN based system is further 
investigated in [17] using AF relaying protocol. Referring to 
Fig. 1(b), initially the relay initiates and transmits a 
sequence of signals to both users so that each user can 
estimate its own channel gain before the transmission 
process. In Phase 1, both User 1 and User 2 transmits the 
double-differentially encoded signals via a relay in the 
presence of Doppler effects. At the relay, the received 
signals are converted from the carrier frequency to baseband 
signals. During Phase II, the relay amplifies the complex 
conjugate of the combined received signal with a 
multiplication factor. The resulting signal is then converted 
to its carrier frequency and broadcasted to the desired 
destination. The signal received encompasses the self- 
interference as well as the desired signal from its 
counterpart. By using the DD demodulation, the information 
symbol of the user can be decoded after the self- 
interference 2 cancellation process. The average Symbol 
Error Rate (SER) shows that the proposed scheme 
performed better as compared to the OWRN with the 
assumption that the frequency offset is perfectly estimated 
and compensated. However, it can be observed that the 
performance of SER degrades when the pilot symbols are 
reduced. The need of extra overhead information reduces the 
bandwidth. Furthermore, the said scheme considers a perfect 
self-interference cancellation at the destination, wherein, the 
consideration is impractical in real application [16]. 

Ill . Multiple Symbol Differential Detection 

Multiple Symbol Differential Detection (MSDD) scheme 
is based on the utilization of more than two consecutive 
symbols as compared to the symbol-by- symbol based 
transmission to detect transmitted symbols at the 
destination. 



transmission order protocol is the order in determining that 
which nodes acts as online transmitters, where the source 
node is the first node in the N + 1 network nodes order and 
ends with the destination node [21]. 

2 Self-interference occurs when an unwanted signal is being 
received at the destination over a transmission link from the 
source. 



(a) 




(b) 

Figure 5: Multiple Symbol Differential Detection based on 
(a) Symbol-By- Symbol based Single Differential Detection 
(b) Multiple- Symbol Double Differential Detection 

By employing the SD detection as illustrated in Fig. 5, 
the self-information of the source nodes were removed upon 
receiving the superimposed information broadcasted from 
the relay by estimating the channel knowledge (i.e., 
amplification factor and channel gains). In the following 
stage, the estimation of frequency offset has been employed 
by way of transmitting only two training symbols for each 
information frame. The effect of frequency offset is then 
compensated following the frequency offset estimation. The 
received symbols are detected using the Generalized 
Likelihood Ratio Test (GLRT)-based algorithm 3 . 

In lieu of Double Differential, a two-stage mechanism as 
depicted in Fig. 5(b) is proposed, wherein, the first stage is 
the process of single symbol differential detection as well as 
self-information subtraction bypassing channel estimation. 
For the second stage, classical DD detector is implemented 
because of its attractive characteristics that omits the highly- 
complex frequency offset estimation, compensation as well 
as the tracking circuitry. The detection mechanism can be 
executed by employing symbol-by- symbol and multiple- 
symbol transmission. Multiple-symbol DD detection is 
attractive due to its ability to recover the desired information 
without requiring higher SNR, in order to obtain the same 
average BER performance of the counterpart coherent 
detection, as compared to DD modulation [19,20] Double 
differential multiple-symbol detection can be obtained via 
the single differential basis. The received signal at either 
source nodes Si or S 2 with no additive noise was obtained 
as: 

A (0 = y, ( t)y ; (f-1) =1 V I 2 e^p 2 (t - l)z 2 ( t ) 

= hp 2 (t-\)z 2 (t) (10) 



3 In a GLRT detection, the likelihood function is estimated or 
maximized over an unknown parameter (i.e. the channel 
gains) in order to obtain the decision metrics [22]. 
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